Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms
2025 Β· Nandan Thakur, Crystina Zhang, Xueguang Ma, et al.
Abstract
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35\(\times\), surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7\(\unicode\{x2013\}\)1.4 points on BEIR and by 1.7\(\unicode\{x2013\}\)1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Si
Authors
(none)
Tags
Stats
Related papers
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- Optimizing Dense Retrieval Model Training With Hard Negatives (2021)16.34
- Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining (2024)0.00
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Learning To Retrieve: How To Train A Dense Retrieval Model Effectively And Efficiently (2020)0.00
- Noisy Self-training With Synthetic Queries For Dense Retrieval (2023)0.00
- Syneg: Llm-driven Synthetic Hard-negatives For Dense Retrieval (2024)0.00
- Bica: Effective Biomedical Dense Retrieval With Citation-aware Hard Negatives (2025)0.00