Mmdocir: Benchmarking Multimodal Retrieval For Long Documents
2025 Β· Kuicai Dong, Yujing Chang, Xin Deik Goh, et al.
Abstract
Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable
Authors
(none)
Tags
Stats
Related papers
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- M3retrieve: Benchmarking Multimodal Retrieval For Medicine (2025)2.16
- MM-BRIGHT: A Multi-task Multimodal Benchmark For Reasoning-intensive Retrieval (2026)2.60
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- MRMR: A Realistic And Expert-level Multidisciplinary Benchmark For Reasoning-intensive Multimodal Retrieval (2025)0.00
- Beyond Global Similarity: Towards Fine-grained, Multi-condition Multimodal Retrieval (2026)2.20
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29