Noisy Self-training With Synthetic Queries For Dense Retrieval
2023 · Fan Jiang, Tom Drummond, Trevor Cohn
Abstract
Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote\{Source code is available at https://github.
Authors
(none)
Tags
Stats
Related papers
- Domain Adaptation For Dense Retrieval Through Self-supervision By Pseudo-relevance Labeling (2022)0.00
- Learning Effective Representations For Retrieval Using Self-distillation With Adaptive Relevance Margins (2024)2.26
- Noise-robust Dense Retrieval Via Contrastive Alignment Post Training (2023)0.00
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- Efficiently Teaching An Effective Dense Retriever With Balanced Topic Aware Sampling (2021)17.07
- Adversarial Sampling And Training For Semi-supervised Information Retrieval (2018)14.43
- Unsupervised Dense Information Retrieval With Contrastive Learning (2021)0.00
- How Train-test Leakage Affects Zero-shot Retrieval (2022)3.58