Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval
2025 Β· Aarush Sinha
Abstract
Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models(LLMs). We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters (Qwen3, LLaMA3, Phi4) and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, find that our generative pipeline consistently underperforms traditional corpus-based mining strategies (BM25 and Cross-Encoder). Furthermore, we observe that scaling the generator model does not monotonically improve retrieval performance and find that the 14B parameter model outperforms the 30B model and in some setting
Authors
(none)
Tags
Stats
Related papers
- Syneg: Llm-driven Synthetic Hard-negatives For Dense Retrieval (2024)0.00
- Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms (2025)2.26
- Making Large Language Models Efficient Dense Retrievers (2025)0.00
- Promptreps: Prompting Large Language Models To Generate Dense And Sparse Representations For Zero-shot Document Retrieval (2024)10.61
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Scalingnote: Scaling Up Retrievers With Large Language Models For Real-world Dense Retrieval (2024)0.00
- Soft Prompt Tuning For Augmenting Dense Retrieval With Large Language Models (2023)9.41
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34