Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval
2020 Β· Hadi Abdi Khojasteh, Ebrahim Ansari, Parvin Razzaghi, et al.
Abstract
This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchmark
Authors
(none)
Tags
Stats
Related papers
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Revisiting Cross Modal Retrieval (2018)0.00
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task (2019)2.26
- Learning To Embed Semantic Similarity For Joint Image-text Retrieval (2022)7.50
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52
- Intra-modal Constraint Loss For Image-text Retrieval (2022)8.33
- Cross-modal Image Retrieval With Deep Mutual Information Maximization (2021)9.59