Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval
2018 Β· Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, et al.
Abstract
Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may lead to a biased model. Inspired by the recent success of webly supervised learning in deep neural networks, we capitalize on readily-available web images with noisy annotations to learn robust image-text joint representation. Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. We propose a two-stage approach for the task that can augment a typical supervise
Authors
(none)
Tags
Stats
Related papers
- Learning Joint Representations Of Videos And Sentences With Web Image Search (2016)12.93
- Self-supervised Learning From Web Data For Multimodal Retrieval (2019)8.09
- Learning Robust Visual-semantic Embeddings (2017)15.22
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Learning To Embed Semantic Similarity For Joint Image-text Retrieval (2022)7.50
- Learning To Learn From Web Data Through Deep Semantic Embeddings (2018)9.03
- Joint Wasserstein Autoencoders For Aligning Multimodal Embeddings (2019)7.16
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08