Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding
2025 Β· Yinglu Li, Zhiying Lu, Zhihang Liu, et al.
Abstract
Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose RegionRAG, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions.
Authors
(none)
Tags
Stats
Related papers
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00
- Are We On The Right Way For Assessing Document Retrieval-augmented Generation? (2025)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- Retrieval-augmented Perception: High-resolution Image Perception Meets Visual RAG (2025)0.00