Dense Retrievers Can Fail On Simple Queries: Revealing The Granularity Dilemma Of Embeddings
2025 Β· Liyan Xu, Zhenlin Su, Mo Yu, et al.
Abstract
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
Authors
(none)
Tags
Stats
Code
Related papers
- Scaling Laws For Embedding Dimension In Information Retrieval (2026)0.00
- What Are You Token About? Dense Retrieval As Distributions Over The Vocabulary (2022)8.09
- Back To Basics: A Simple Recipe For Improving Out-of-domain Retrieval In Dense Encoders (2023)0.00
- On The Theoretical Limitations Of Embedding-based Retrieval (2025)0.00
- Less Is More: Pre-train A Strong Text Encoder For Dense Retrieval Using A Weak Decoder (2021)14.29
- Lexsembridge: Fine-grained Dense Representation Enhancement Through Token-aware Embedding Augmentation (2025)2.35
- A Fresh Take On Stale Embeddings: Improving Dense Retriever Training With Corrector Networks (2024)0.00
- Query Encoder Distillation Via Embedding Alignment Is A Strong Baseline Method To Boost Dense Retriever Online Efficiency (2023)0.00