MURAL: Multimodal, Multitask Retrieval Across Languages
2021 Β· Aashi Jain, Mandy Guo, Krishna Srinivasan, et al.
Abstract
Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.
Authors
(none)
Tags
Stats
Related papers
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- MULE: Multimodal Universal Language Embedding (2019)9.03
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05
- Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task (2019)2.26
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Bootstrapping Disjoint Datasets For Multilingual Multimodal Representation Learning (2019)0.00