Unidoc-rl: Coarse-to-fine Visual RAG With Hierarchical Actions And Dense Rewards
2026 Β· Jun Wang, Shuo Tan, Zelong Sun, et al.
Abstract
Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Pol
Authors
(none)
Tags
Stats
Related papers
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Simpledoc: Multi-modal Document Understanding With Dual-cue Page Retrieval And Iterative Refinement (2025)5.50
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Visrag 2.0: Evidence-guided Multi-image Reasoning In Visual Retrieval-augmented Generation (2025)0.00
- Domain-aware RAG: Mol-enhanced RL For Efficient Training And Scalable Retrieval (2025)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00