Chembed: Enhancing Chemical Literature Search Through Domain-specific Text Embeddings
2025 Β· Ali Shiraee Kasmaee, Mohammad Khodadad, Mehdi Astaraki, et al.
Abstract
Retrieval-Augmented Generation (RAG) systems in chemistry heavily depend on accurate and relevant retrieval of chemical literature. However, general-purpose text embedding models frequently fail to adequately represent complex chemical terminologies, resulting in suboptimal retrieval quality. Specialized embedding models tailored to chemical literature retrieval have not yet been developed, leaving a substantial performance gap. To address this challenge, we introduce ChEmbed, a domain-adapted family of text embedding models fine-tuned on a dataset comprising chemistry-specific text from the PubChem, Semantic Scholar, and ChemRxiv corpora. To create effective training data, we employ large language models to synthetically generate queries, resulting in approximately 1.7 million high-quality query-passage pairs. Additionally, we augment the tokenizer by adding 900 chemically specialized tokens to previously unused slots, which significantly reduces the fragmentation of chemical entities
Authors
(none)
Tags
Stats
Related papers
- Chunk Twice, Embed Once: A Systematic Study Of Segmentation And Representation Trade-offs In Chemistry-aware Retrieval-augmented Generation (2025)0.00
- Medeir: A Specialized Medical Embedding Model For Enhanced Information Retrieval (2025)0.00
- Utilizing Low-dimensional Molecular Embeddings For Rapid Chemical Similarity Search (2024)4.52
- Reasonembed: Enhanced Text Embeddings For Reasoning-intensive Document Retrieval (2025)0.00
- Efficient Fine-tuning Methodology Of Text Embedding Models For Information Retrieval: Contrastive Learning Penalty (clp) (2024)2.16
- Thin Bridges For Drug Text Alignment: Lightweight Contrastive Learning For Target Specific Drug Retrieval (2025)0.00
- Lexsembridge: Fine-grained Dense Representation Enhancement Through Token-aware Embedding Augmentation (2025)2.35
- Utilizing Metadata For Better Retrieval-augmented Generation (2026)0.00