Syneg: Llm-driven Synthetic Hard-negatives For Dense Retrieval
2024 Β· Xiaopeng Li, Xiangyang Li, Hao Zhang, et al.
Abstract
The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling. Traditional DR methods primarily depend on naive negative sampling techniques or on mining hard negatives through external retriever and meticulously crafted strategies. However, naive negative sampling often fails to adequately capture the accurate boundaries between positive and negative samples, whereas existing hard negative sampling methods are prone to false negatives, resulting in performance degradation and training instability. Recent advancements in large language models (LLMs) offer an innovative solution to these challenges by generating contextually rich and diverse negative samples. In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples. We first devise a \textit\{multi-attribute self-reflection prompting strategy\} to direct LLMs in hard negative sample generation. Then, we implement a \textit\{hybrid sampling strateg
Authors
(none)
Tags
Stats
Related papers
- Optimizing Dense Retrieval Model Training With Hard Negatives (2021)16.34
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- From Distillation To Hard Negative Sampling: Making Sparse Neural IR Models More Effective (2022)0.00
- ESANS: Effective And Semantic-aware Negative Sampling For Large-scale Retrieval Systems (2025)2.26
- Learning To Retrieve: How To Train A Dense Retrieval Model Effectively And Efficiently (2020)0.00
- Approximate Nearest Neighbor Negative Contrastive Learning For Dense Text Retrieval (2020)0.00
- Efficiently Teaching An Effective Dense Retriever With Balanced Topic Aware Sampling (2021)17.07
- Docrerank: Single-page Hard Negative Query Generation For Training Multi-modal RAG Rerankers (2025)3.58