MULE: Multimodal Universal Language Embedding
2019 Β· Donghyun Kim, Kuniaki Saito, Kate Saenko, et al.
Abstract
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of MULE on the bidirectional image-senten
Authors
(none)
Tags
Stats
Related papers
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- MURAL: Multimodal, Multitask Retrieval Across Languages (2021)0.00
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Objembed: Towards Universal Multimodal Object Embeddings (2026)0.00