Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark
2024 Β· Tsung-Han Wu, Giscard Biamby, Jerome Quenum, et al.
Abstract
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solutio
Authors
(none)
Tags
Stats
Related papers
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- Multimodal Needle In A Haystack: Benchmarking Long-context Capability Of Multimodal Large Language Models (2024)11.84
- Multihaystack: Benchmarking Multimodal Retrieval And Reasoning Over 40K Images, Videos, And Documents (2026)0.00
- MAGNET: A Multi-agent Framework For Finding Audio-visual Needles By Reasoning Over Multi-video Haystacks (2025)0.00
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- Detect, Describe, Discriminate: Moving Beyond VQA For MLLM Evaluation (2024)0.00
- HIVE: Query, Hypothesize, Verify An LLM Framework For Multimodal Reasoning-intensive Retrieval (2026)0.00
- Benchmarking Deflection And Hallucination In Large Vision-language Models (2026)0.00