M3DR: Towards Universal Multilingual Multimodal Document Retrieval
2025 Β· Adithya S Kolavi, Vyoman Jain
Abstract
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that cap
Authors
(none)
Tags
Stats
Related papers
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Mmdocir: Benchmarking Multimodal Retrieval For Long Documents (2025)3.58
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- M3P: Learning Universal Representations Via Multitask Multilingual Multimodal Pre-training (2020)12.93
- Globaldoc: A Cross-modal Vision-language Framework For Real-world Document Image Retrieval And Classification (2023)3.58