Multitask Text-to-visual Embedding With Titles And Clickthrough Data
2019 Β· Pranav Aggarwal, Zhe Lin, Baldo Faieta, et al.
Abstract
Text-visual (or called semantic-visual) embedding is a central problem in vision-language research. It typically involves mapping of an image and a text description to a common feature space through a CNN image encoder and a RNN language encoder. In this paper, we propose a new method for learning text-visual embedding using both image titles and click-through data from an image search engine. We also propose a new triplet loss function by modeling positive awareness of the embedding, and introduce a novel mini-batch-based hard negative sampling approach for better data efficiency in the learning process. Experimental results show that our proposed method outperforms existing methods, and is also effective for real-world text-to-visual retrieval.
Authors
(none)
Tags
Stats
Related papers
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- MHSAN: Multi-head Self-attention Network For Visual Semantic Embedding (2020)10.48
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Beyond Visual Semantics: Exploring The Role Of Scene Text In Image Understanding (2019)9.59
- VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval (2024)16.73
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16