VISOR: Agentic Visual Retrieval-augmented Generation Via Iterative Search And Over-horizon Reasoning
2026 Β· Yucheng Shen, Jiulong Wu, Jizhou Huang, et al.
Abstract
Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross
Authors
(none)
Tags
Stats
Related papers
- Visrag 2.0: Evidence-guided Multi-image Reasoning In Visual Retrieval-augmented Generation (2025)0.00
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Robustvisrag: Causality-aware Vision-based Retrieval-augmented Generation Under Visual Degradations (2026)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00
- Leveraging Retrieval-augmented Tags For Large Vision-language Understanding In Complex Scenes (2024)0.00
- Agenticocr: Parsing Only What You Need For Efficient Retrieval-augmented Generation (2026)0.00
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34