Docmmir: A Framework For Document Multi-modal Information Retrieval
2025 Β· Zirui Li, Siwei Wu, Yizhi Li, et al.
Abstract
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstr
Authors
(none)
Tags
Stats
Related papers
- Scimmir: Benchmarking Scientific Multi-modal Information Retrieval (2024)8.07
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- Mmdocir: Benchmarking Multimodal Retrieval For Long Documents (2025)3.58
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00
- MIRACL-VISION: A Large, Multilingual, Visual Document Retrieval Benchmark (2025)0.00
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00