Multilingual Diversity Improves Vision-language Representations
2024 Β· Thao Nguyen, Matthew Wallingford, Sebastin Santy, et al.
Abstract
Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text pairs and discard many potentially useful non-English samples. Our work questions this practice. Multilingual data is inherently enriching not only because it provides a gateway to learn about culturally salient concepts, but also because it depicts common concepts differently from monolingual data. We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. By translating all multilingual image-text pairs from a raw web crawl to English and re-filtering them, we increase the prevalence of (translate
Authors
(none)
Tags
Stats
Related papers
- Multi-head Attention With Diversity For Learning Grounded Multilingual Multimodal Representations (2019)7.81
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- MASS: Overcoming Language Bias In Image-text Matching (2025)0.00
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- Babel-imagenet: Massively Multilingual Evaluation Of Vision-and-language Representations (2023)2.76
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00