Evaluating The Effectiveness And Scalability Of Llm-based Data Augmentation For Retrieval
2025 Β· Pranjal A. Chitale, Bishal Santra, Yashoteja Prabhu, et al.
Abstract
Compact dual-encoder models are widely used for retrieval owing to their efficiency and scalability. However, such models often underperform compared to their Large Language Model (LLM)-based retrieval counterparts, likely due to their limited world knowledge. While LLM-based data augmentation has been proposed as a strategy to bridge this performance gap, there is insufficient understanding of its effectiveness and scalability to real-world retrieval problems. Existing research does not systematically explore key factors such as the optimal augmentation scale, the necessity of using large augmentation models, and whether diverse augmentations improve generalization, particularly in out-of-distribution (OOD) settings. This work presents a comprehensive study of the effectiveness of LLM augmentation for retrieval, comprising over 100 distinct experimental settings of retrieval models, augmentation models and augmentation strategies. We find that, while augmentation enhances retrieval pe
Authors
(none)
Tags
Stats
Related papers
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Expandr: Teaching Dense Retrievers Beyond Queries With LLM Guidance (2025)3.25
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34
- Scalingnote: Scaling Up Retrievers With Large Language Models For Real-world Dense Retrieval (2024)0.00
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57
- Scaling Laws For Dense Retrieval (2024)10.07
- A Comparative Study Of Specialized Llms As Dense Retrievers (2025)2.26
- Alleviating Hallucination In Large Vision-language Models With Active Retrieval Augmentation (2024)7.16