Fact Or Facsimile? Evaluating The Factual Robustness Of Modern Retrievers
2025 Β· Haoyu Wu, Qingcheng Zeng, Kaize Ding
Abstract
Dense retrievers and rerankers are central to retrieval-augmented generation (RAG) pipelines, where accurately retrieving factual information is crucial for maintaining system trustworthiness and defending against RAG poisoning. However, little is known about how much factual competence these components inherit or lose from the large language models (LLMs) they are based on. We pair 12 publicly released embedding checkpoints with their original base LLMs and evaluate both sets on a factuality benchmark. Across every model evaluated, the embedding variants achieve markedly lower accuracy than their bases, with absolute drops ranging from 12 to 43 percentage points (median 28 pts) and typical retriever accuracies collapsing into the 25-35 % band versus the 60-70 % attained by the generative models. This degradation intensifies under a more demanding condition: when the candidate pool per question is expanded from four options to one thousand, the strongest retriever's top-1 accuracy fall
Authors
(none)
Tags
Stats
Related papers
- Rar-b: Reasoning As Retrieval Benchmark (2024)2.68
- Mor: Better Handling Diverse Queries With A Mixture Of Sparse, Dense, And Human Retrievers (2025)2.26
- Making Large Language Models Efficient Dense Retrievers (2025)0.00
- Dense Retrievers Can Fail On Simple Queries: Revealing The Granularity Dilemma Of Embeddings (2025)2.86
- Optimizing Retrieval-augmented Generation: Analysis Of Hyperparameter Impact On Performance And Efficiency (2025)0.00
- With Argus Eyes: Assessing Retrieval Gaps Via Uncertainty Scoring To Detect And Remedy Retrieval Blind Spots (2026)0.00
- Frustratingly Simple Retrieval Improves Challenging, Reasoning-intensive Benchmarks (2025)0.00
- Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms (2025)2.26