Medeir: A Specialized Medical Embedding Model For Enhanced Information Retrieval
2025 Β· Anand Selvadurai, Jasheen Shaik, Girish Chandrasekar, et al.
Abstract
Embedding models have become essential for retrieval-augmented generation (RAG) tasks, semantic clustering, and text re-ranking. But despite their growing use, many of these come with notable limitations. For example, Jina fails to capture the semantic content of medical documents, while models such as MiniLM often perform poorly on long-form documents. Domain-adapted models, while specialized, often underperform in general-purpose tasks, reducing their overall applicability. General-domain tokenizers often misinterpret medical vocabulary. The limitations of current embedding models, whether in tokenization accuracy, domain comprehension, or handling long sequences, highlight the need for more versatile solutions. In this work, we present MedEIR, a novel embedding model and tokenizer jointly optimized for both medical and general NLP tasks, incorporating ALiBi-based long-context processing to support sequences of up to 8,192 tokens. MedEIR was pre-trained on only 6 billion tokens, sign
Authors
(none)
Tags
Stats
Related papers
- Evaluating Embedding Apis For Information Retrieval (2023)8.09
- Medimageinsight: An Open-source Embedding Model For General Domain Medical Imaging (2024)0.00
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Cardioembed: Domain-specialized Text Embeddings For Clinical Cardiology (2025)0.00
- Rzenembed: Towards Comprehensive Multimodal Retrieval (2025)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Rethinking Hybrid Retrieval: When Small Embeddings And LLM Re-ranking Beat Bigger Models (2025)0.00
- Dewey Long Context Embedding Model: A Technical Report (2025)0.00