M3P: Learning Universal Representations Via Multitask Multilingual Multimodal Pre-training
2020 Β· Minheng Ni, Haoyang Huang, Lin Su, et al.
Abstract
We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
Authors
(none)
Tags
Stats
Related papers
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- MULE: Multimodal Universal Language Embedding (2019)9.03
- Cross-view Language Modeling: Towards Unified Cross-lingual Cross-modal Pre-training (2022)8.09
- Multi-head Attention With Diversity For Learning Grounded Multilingual Multimodal Representations (2019)7.81
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32