Learning To Scale Multilingual Representations For Vision-language Tasks
2020 Β· Andrea Burns, Donghyun Kim, Derry Wijaya, et al.
Abstract
Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-
Authors
(none)
Tags
Stats
Related papers
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- MULE: Multimodal Universal Language Embedding (2019)9.03
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Language Features Matter: Effective Language Representations For Vision-language Tasks (2019)8.60
- Multilingual Diversity Improves Vision-language Representations (2024)2.26
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00