MHSAN: Multi-head Self-attention Network For Visual Semantic Embedding
2020 Β· Geondo Park, Chihye Han, Wonjun Yoon, et al.
Abstract
Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the tex
Authors
(none)
Tags
Stats
Related papers
- Multitask Text-to-visual Embedding With Titles And Clickthrough Data (2019)0.00
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Polysemous Visual-semantic Embedding For Cross-modal Retrieval (2019)17.70
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- Conditional Cross Attention Network For Multi-space Embedding Without Entanglement In Only A SINGLE Network (2023)3.58
- Learning Robust Visual-semantic Embeddings (2017)15.22
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Combating Visual Neglect And Semantic Drift In Large Multimodal Models For Enhanced Cross-modal Retrieval (2026)0.00