Self-supervised Visual Representations For Cross-modal Retrieval
2019 · Yash Patel, Lluis Gomez, Marçal Rusiñol, et al.
Abstract
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative vi
Authors
(none)
Tags
Stats
Related papers
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Semi-supervised Cross-modal Retrieval With Label Prediction (2018)11.29
- Self-supervised Learning From Web Data For Multimodal Retrieval (2019)8.09
- Preserving Semantic Neighborhoods For Robust Cross-modal Retrieval (2020)10.07
- Do Cross Modal Systems Leverage Semantic Relationships? (2019)7.16
- Feature Representation Learning For Unsupervised Cross-domain Image Retrieval (2022)11.46
- Newsstories: Illustrating Articles With Visual Summaries (2022)2.26
- Self-supervised Adversarial Hashing Networks For Cross-modal Retrieval (2018)19.56