Teaching Dense Retrieval Models To Specialize With Listwise Distillation And LLM Data Augmentation
2025 Β· Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, et al.
Abstract
While the current state-of-the-art dense retrieval models exhibit strong out-of-domain generalization, they might fail to capture nuanced domain-specific knowledge. In principle, fine-tuning these models for specialized retrieval tasks should yield higher effectiveness than relying on a one-size-fits-all model, but in practice, results can disappoint. We show that standard fine-tuning methods using an InfoNCE loss can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios. This holds true even when applying widely adopted techniques such as hard-negative mining and negative de-noising. To address this, we explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever. We further explore synthetic query generation using large language models. Through listwise distillation and training with a diverse set of queries ranging from natural user searches and factual c
Authors
(none)
Tags
Stats
Related papers
- Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval With Cross-encoder Listwise Distillation And Synthetic Data (2025)0.00
- Domain Adaptation For Dense Retrieval Through Self-supervision By Pseudo-relevance Labeling (2022)0.00
- Curriculum Learning For Dense Retrieval Distillation (2022)11.49
- Expandr: Teaching Dense Retrievers Beyond Queries With LLM Guidance (2025)3.25
- Translate-distill: Learning Cross-language Dense Retrieval By Translation And Distillation (2024)8.60
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- Learning Effective Representations For Retrieval Using Self-distillation With Adaptive Relevance Margins (2024)2.26
- How To Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval (2023)11.39