UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training
2021 Β· Mingyang Zhou, Luowei Zhou, Shuohang Wang, et al.
Abstract
Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data.
Authors
(none)
Tags
Stats
Related papers
- Cross-view Language Modeling: Towards Unified Cross-lingual Cross-modal Pre-training (2022)8.09
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- M3P: Learning Universal Representations Via Multitask Multilingual Multimodal Pre-training (2020)12.93
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- CL2CM: Improving Cross-lingual Cross-modal Retrieval Via Cross-lingual Knowledge Transfer (2023)8.60
- MULE: Multimodal Universal Language Embedding (2019)9.03