Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval With Cross-encoder Listwise Distillation And Synthetic Data
2025 Β· Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, et al.
Abstract
We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning. Prior work has shown that fine-tuning with queries generated using a dataset's retrieval corpus can boost retrieval effectiveness for the dataset. However, we find that surprisingly, fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models. To overcome this, we revisit cross-encoder listwise distillation and demonstrate that, unlike using contrastive learning alone, listwise distillation can help more consistently improve retrieval effectiveness across multiple datasets. Additionally, we show that synthesizing more training data using diverse query types (such as claims, keywords, and questions) yields greater effectiveness than using any single query type alone, regardless of the query type used in evaluation. Our findings further indicate that synthetic queries offer comparable utility to human-written quer
Authors
(none)
Tags
Stats
Related papers
- Teaching Dense Retrieval Models To Specialize With Listwise Distillation And LLM Data Augmentation (2025)0.00
- Unsupervised Dense Information Retrieval With Contrastive Learning (2021)0.00
- Unsupervised Dense Retrieval With Conterfactual Contrastive Learning (2024)0.00
- Pre-training Vs. Fine-tuning: A Reproducibility Study On Dense Retrieval Knowledge Acquisition (2025)0.95
- Translate-distill: Learning Cross-language Dense Retrieval By Translation And Distillation (2024)8.60
- More Robust Dense Retrieval With Contrastive Dual Learning (2021)11.88
- Learning Effective Representations For Retrieval Using Self-distillation With Adaptive Relevance Margins (2024)2.26
- Noisy Self-training With Synthetic Queries For Dense Retrieval (2023)0.00