M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining
2024 Β· Qingpei Guo, Furong Xu, Hanxiao Zhang, et al.
Abstract
Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as \(M^2\)-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval
Authors
(none)
Tags
Stats
Related papers
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- CLIMP: Contrastive Language-image Mamba Pretraining (2026)0.00
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00