What Are You Token About? Dense Retrieval As Distributions Over The Vocabulary
2022 Β· Ori Ram, Liat Bezalel, Adi Zicher, et al.
Abstract
Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval. We find that this view can offer an explanation for some of the failure cases of dense retrievers. For example, we observe that the inability of models to handle tail entities is correlated with a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to
Authors
(none)
Tags
Stats
Related papers
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Interpreting Dense Retrieval As Mixture Of Topics (2021)0.00
- More Robust Dense Retrieval With Contrastive Dual Learning (2021)11.88
- On The Value Of Behavioral Representations For Dense Retrieval (2022)0.00
- Learning Diverse Document Representations With Deep Query Interactions For Dense Retrieval (2022)2.51
- Dense Retrievers Can Fail On Simple Queries: Revealing The Granularity Dilemma Of Embeddings (2025)2.86
- Analysing The Robustness Of Dual Encoders For Dense Retrieval Against Misspellings (2022)9.59
- Investigating Multi-layer Representations For Dense Passage Retrieval (2025)0.00