HIVE: Query, Hypothesize, Verify An LLM Framework For Multimodal Reasoning-intensive Retrieval
2026 Β· Mahmoud Abdalla, Mahmoud Salaheldin Kasem, Mohamed Mahmoud, et al.
Abstract
Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents -- the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbf\{HIVE\} (\textbf\{H\}ypothesis-driven \textbf\{I\}terative \textbf\{V\}isual \textbf\{E\}vidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top-\(k\) candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803 real-world queries across 29 technical domains), HIVE achieves a new state-of-the-art aggrega
Authors
(none)
Tags
Stats
Related papers
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Multihaystack: Benchmarking Multimodal Retrieval And Reasoning Over 40K Images, Videos, And Documents (2026)0.00
- Reasoning-augmented Representations For Multimodal Retrieval (2026)0.00
- MARVEL: Multimodal Adaptive Reasoning-intensive Expand-rerank And Retrieval (2026)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- MM-BRIGHT: A Multi-task Multimodal Benchmark For Reasoning-intensive Retrieval (2026)2.60
- Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark (2024)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00