Improving Cross-modal Retrieval With Set Of Diverse Embeddings
2022 Β· Dongwon Kim, Namyup Kim, Suha Kwak
Abstract
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods includin
Authors
(none)
Tags
Stats
Related papers
- Probabilistic Embeddings For Cross-modal Retrieval (2021)21.70
- Maximal Matching Matters: Preventing Representation Collapse For Robust Cross-modal Retrieval (2025)2.26
- Revisiting Cross Modal Retrieval (2018)0.00
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Dynamic Visual Semantic Sub-embeddings And Fast Re-ranking (2023)0.00
- Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task (2019)2.26
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17