Fix Before Search: Benchmarking Agentic Query Visual Pre-processing In Multimodal Retrieval-augmented Generation
2026 Β· Jiankun Zhang, Shenglai Zeng, Kai Guo, et al.
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect'' -- suffering from geometric distortions, quality degradation, or semantic ambiguity -- leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability -- visual imperfections severely degrade both retrieval recall and e
Authors
(none)
Tags
Stats
Related papers
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Hv-attack: Hierarchical Visual Attack For Multimodal Retrieval Augmented Generation (2025)0.00
- Pixel-grounded Retrieval For Knowledgeable Large Multimodal Models (2026)0.00
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00