LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing
2025 Β· Yao Zhao, Yantian Ding, Zhiyue Zhang, et al.
Abstract
Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results across multiple domain-specific benchmark datasets demonstrate that LMAR outperforms multiple baselin
Authors
(none)
Tags
Stats
Related papers
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- MLLM Is A Strong Reranker: Advancing Multimodal Retrieval-augmented Generation Via Knowledge-enhanced Reranking And Noise-injected Training (2024)9.18
- Domain-aware RAG: Mol-enhanced RL For Efficient Training And Scalable Retrieval (2025)0.00
- Advancing Retrieval-augmented Generation For Structured Enterprise And Internal Data (2025)1.20
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Expandr: Teaching Dense Retrievers Beyond Queries With LLM Guidance (2025)3.25
- Evaluating The Effectiveness And Scalability Of Llm-based Data Augmentation For Retrieval (2025)0.00