Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text
2026 Β· Piyush Singh Pasi
Abstract
Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers--using English text alone--to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, M2M demonstrates robustness across datase
Authors
(none)
Tags
Stats
Related papers
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- MURAL: Multimodal, Multitask Retrieval Across Languages (2021)0.00
- MULE: Multimodal Universal Language Embedding (2019)9.03
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- M3P: Learning Universal Representations Via Multitask Multilingual Multimodal Pre-training (2020)12.93
- Textme: Bridging Unseen Modalities Through Text Descriptions (2026)0.00
- Multiway-adapater: Adapting Large-scale Multi-modal Models For Scalable Image-text Retrieval (2023)0.00
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00