Benchmarking Deflection And Hallucination In Large Vision-language Models
2026 Β· Nicholas Moratelli, Christopher Davis, Leonardo F. R. Ribeiro, et al.
Abstract
Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experim
Authors
(none)
Tags
Stats
Related papers
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- Alleviating Hallucination In Large Vision-language Models With Active Retrieval Augmentation (2024)7.16
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark (2024)0.00
- Aligning Vision Models With Human Aesthetics In Retrieval: Benchmarks And Algorithms (2024)0.00
- Detect, Describe, Discriminate: Moving Beyond VQA For MLLM Evaluation (2024)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00