CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
2025 Β· Jiaqi Wang, Xiao Yang, Kai Sun, et al.
Abstract
Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmenta
Authors
(none)
Tags
Stats
Related papers
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- REAL-MM-RAG: A Real-world Multi-modal Retrieval Benchmark (2025)4.52
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Rag-check: Evaluating Multimodal Retrieval Augmented Generation Performance (2025)0.00
- M3retrieve: Benchmarking Multimodal Retrieval For Medicine (2025)2.16
- MG\(^2\)-RAG: Multi-granularity Graph For Multimodal Retrieval-augmented Generation (2026)0.00