Structured Multi-modal Feature Embedding And Alignment For Image-sentence Retrieval
2021 Β· Xuri Ge, Fuhai Chen, Joemon M. Jose, et al.
Abstract
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog \(\to\) play \(\to\) ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SM
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Semantic Enhanced Interaction For Image-sentence Retrieval (2022)12.33
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)19.48
- Towards Cross-modal Text-molecule Retrieval With Better Modality Alignment (2024)4.52
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Embedding Arithmetic Of Multimodal Queries For Image Retrieval (2021)9.03
- Webly Supervised Joint Embedding For Cross-modal Image-text Retrieval (2018)13.17
- A New Fine-grained Alignment Method For Image-text Matching (2023)0.00