Cross-view Language Modeling: Towards Unified Cross-lingual Cross-modal Pre-training
2022 Β· Yan Zeng, Wangchunshu Zhou, Ao Luo, et al.
Abstract
In this paper, we introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval data
Authors
(none)
Tags
Stats
Related papers
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05
- CL2CM: Improving Cross-lingual Cross-modal Retrieval Via Cross-lingual Knowledge Transfer (2023)8.60
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Efficientclip: Efficient Cross-modal Pre-training By Ensemble Confident Learning And Language Modeling (2021)0.00