Image-text Retrieval Via Preserving Main Semantics Of Vision
2023 Β· Xu Zhang, Xinzheng Niu, Philippe Fournier-Viger, et al.
Abstract
Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL.
Authors
(none)
Tags
Stats
Code
Related papers
- Beyond Visual Semantics: Exploring The Role Of Scene Text In Image Understanding (2019)9.59
- Visual Semantic Reasoning For Image-text Matching (2019)25.23
- Preserving Semantic Neighborhoods For Robust Cross-modal Retrieval (2020)10.07
- Semantic-preserving Augmentation For Robust Image-text Retrieval (2023)5.24
- Rethinking Benchmarks For Cross-modal Image-text Retrieval (2023)13.11
- Fine-grained Image Classification And Retrieval By Combining Visual And Locally Pooled Textual Features (2020)10.48
- Tsvc:tripartite Learning With Semantic Variation Consistency For Robust Image-text Retrieval (2025)3.58
- Direction-oriented Visual-semantic Embedding Model For Remote Sensing Image-text Retrieval (2023)11.29