Constructing Phrase-level Semantic Labels To Form Multi-grained Supervision For Image-text Retrieval
2021 Β· Zhihao Fan, Zhongyu Wei, Zejun Li, et al.
Abstract
Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For
Authors
(none)
Tags
Stats
Related papers
- Semi Supervised Phrase Localization In A Bidirectional Caption-image Retrieval Framework (2019)0.00
- Structured Multi-modal Feature Embedding And Alignment For Image-sentence Retrieval (2021)12.87
- SAC: Semantic Attention Composition For Text-conditioned Image Retrieval (2020)11.49
- Image-text Retrieval Via Preserving Main Semantics Of Vision (2023)10.22
- Image-text Retrieval With Binary And Continuous Label Supervision (2022)0.00
- Tsvc:tripartite Learning With Semantic Variation Consistency For Robust Image-text Retrieval (2025)3.58
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval (2025)6.77