Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task
2019 Β· Alireza Mohammadshahi, Remi Lebret, Karl Aberer
Abstract
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions.
Authors
(none)
Tags
Stats
Related papers
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Order Embeddings And Character-level Convolutions For Multimodal Alignment (2017)9.03
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00
- Bootstrapping Disjoint Datasets For Multilingual Multimodal Representation Learning (2019)0.00
- Multi-head Attention With Diversity For Learning Grounded Multilingual Multimodal Representations (2019)7.81
- Revisiting Cross Modal Retrieval (2018)0.00
- Objembed: Towards Universal Multimodal Object Embeddings (2026)0.00