Are We On The Right Way For Assessing Document Retrieval-augmented Generation?
2025 Β· Wenxuan Shen, Mingjia Wang, Yaochen Wang, et al.
Abstract
Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehe
Authors
(none)
Tags
Stats
Related papers
- Rag-check: Evaluating Multimodal Retrieval Augmented Generation Performance (2025)0.00
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Graph-aware Late Chunking For Retrieval-augmented Generation In Biomedical Literature (2026)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- REAL-MM-RAG: A Real-world Multi-modal Retrieval Benchmark (2025)4.52