Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms

Abstract

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35\(\times\), surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7\(\unicode\{x2013\}\)1.4 points on BEIR and by 1.7\(\unicode\{x2013\}\)1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Si

Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms

Abstract

Authors

Tags

Stats

Related papers