Multihaystack: Benchmarking Multimodal Retrieval And Reasoning Over 40K Images, Videos, And Documents
2026 Β· Dannong Xu, Zhongyu Yang, Jun Chen, et al.
Abstract
Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find th
Authors
(none)
Tags
Stats
Related papers
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark (2024)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Multimodal Needle In A Haystack: Benchmarking Long-context Capability Of Multimodal Large Language Models (2024)11.84
- MAGNET: A Multi-agent Framework For Finding Audio-visual Needles By Reasoning Over Multi-video Haystacks (2025)0.00
- HIVE: Query, Hypothesize, Verify An LLM Framework For Multimodal Reasoning-intensive Retrieval (2026)0.00
- Mr\(^2\)-bench: Going Beyond Matching To Reasoning In Multimodal Retrieval (2025)1.81
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00