Aggretriever: A Simple Approach To Aggregate Textual Representations For Robust Dense Passage Retrieval
2022 Β· Sheng-Chieh Lin, Minghan Li, Jimmy Lin
Abstract
Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that models such as BERT are not ``structurally ready'' to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This ``lack of readiness'' results from the gap between language model pre-training and DPR fine-tuning. Previous solutions call for computationally expensive techniques such as hard negative mining, cross-encoder distillation, and further pre-training to learn a robust DPR model. In this work, we instead propose to fully exploit knowledge in a pre-trained language model for DPR by aggregating the contextualized token embeddings into a dense vector, which we call agg*. By concatenating vectors from the [CLS] token and agg*, our Aggretriever model substantially improves the effectiveness of dense retrieval models on both in-domain and zero-shot evaluations without introducing substantial training overhead. Code is available at h
Authors
(none)
Tags
Stats
Related papers
- Dense Passage Retrieval: Is It Retrieving? (2024)6.34
- PARM: A Paragraph Aggregation Retrieval Model For Dense Document-to-document Retrieval (2022)8.35
- Pre-training Vs. Fine-tuning: A Reproducibility Study On Dense Retrieval Knowledge Acquisition (2025)0.95
- QAEA-DR: A Unified Text Augmentation Framework For Dense Retrieval (2024)5.24
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Bridging The Training-inference Gap For Dense Phrase Retrieval (2022)2.26
- Investigating Multi-layer Representations For Dense Passage Retrieval (2025)0.00
- Optimized Text Embedding Models And Benchmarks For Amharic Passage Retrieval (2025)4.94