Retrieval-augmented Perception: High-resolution Image Perception Meets Visual RAG
2025 Β· Wenbin Wang, Yongcheng Jing, Liang Ding, et al.
Abstract
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrat
Authors
(none)
Tags
Stats
Related papers
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Enhancing Image Quality Assessment Ability Of Lmms Via Retrieval-augmented Generation (2026)0.00