Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models
2026 Β· Yu Zeng, Wenxuan Huang, Zhen Fang, et al.
Abstract
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, design
Authors
(none)
Tags
Stats
Related papers
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- Detect, Describe, Discriminate: Moving Beyond VQA For MLLM Evaluation (2024)0.00
- Benchmarking Deflection And Hallucination In Large Vision-language Models (2026)0.00
- Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark (2024)0.00
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- MIRACL-VISION: A Large, Multilingual, Visual Document Retrieval Benchmark (2025)0.00
- Vidore Benchmark V2: Raising The Bar For Visual Retrieval (2025)0.00