Region-r1: Reinforcing Query-side Region Cropping For Multi-modal Re-ranking
2026 Β· Chan-Wei Hu, Zhengzhong Tu
Abstract
Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adapta
Authors
(none)
Tags
Stats
Related papers
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Discriminative Multi-view Privileged Information Learning For Image Re-ranking (2018)8.60
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- Fix Before Search: Benchmarking Agentic Query Visual Pre-processing In Multimodal Retrieval-augmented Generation (2026)1.24
- Visrag 2.0: Evidence-guided Multi-image Reasoning In Visual Retrieval-augmented Generation (2025)0.00
- Retrieval-augmented Perception: High-resolution Image Perception Meets Visual RAG (2025)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00