V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval
2026 Β· Dongyang Chen, Chaoyang Wang, Dezhao Su, et al.
Abstract
Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforc
Authors
(none)
Tags
Stats
Related papers
- Reasoning-augmented Representations For Multimodal Retrieval (2026)0.00
- TRACE: Task-adaptive Reasoning And Representation Learning For Universal Multimodal Retrieval (2026)0.00
- MARVEL: Multimodal Adaptive Reasoning-intensive Expand-rerank And Retrieval (2026)0.00
- Reasoning Guided Embeddings: Leveraging MLLM Reasoning For Improved Multimodal Retrieval (2025)0.00
- HIVE: Query, Hypothesize, Verify An LLM Framework For Multimodal Reasoning-intensive Retrieval (2026)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Chain-of-thought Re-ranking For Image Retrieval Tasks (2025)1.81
- Multihaystack: Benchmarking Multimodal Retrieval And Reasoning Over 40K Images, Videos, And Documents (2026)0.00