Finding Beans In Burgers: Deep Semantic-visual Embedding With Localization
2018 Β· Martin Engilberge, Louis Chevallier, Patrick PΓ©rez, et al.
Abstract
Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.
Authors
(none)
Tags
Stats
Related papers
- Semi Supervised Phrase Localization In A Bidirectional Caption-image Retrieval Framework (2019)0.00
- Learning To Embed Semantic Similarity For Joint Image-text Retrieval (2022)7.50
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- MHSAN: Multi-head Self-attention Network For Visual Semantic Embedding (2020)10.48
- Learning To Learn From Web Data Through Deep Semantic Embeddings (2018)9.03
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27