Multi-head Attention With Diversity For Learning Grounded Multilingual Multimodal Representations
2019 Β· Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann
Abstract
With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.
Authors
(none)
Tags
Stats
Related papers
- Multilingual Diversity Improves Vision-language Representations (2024)2.26
- Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task (2019)2.26
- M3P: Learning Universal Representations Via Multitask Multilingual Multimodal Pre-training (2020)12.93
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Bootstrapping Disjoint Datasets For Multilingual Multimodal Representation Learning (2019)0.00
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- MHSAN: Multi-head Self-attention Network For Visual Semantic Embedding (2020)10.48