Scene Graph Based Fusion Network For Image-text Retrieval
2023 Β· Guoliang Wang, Yanlei Shang, Yong Chen
Abstract
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. Most existing methods mainly focus on coarse-grained correspondences based on co-occurrences of semantic objects, while failing to distinguish the fine-grained local correspondences. In this paper, we propose a novel Scene Graph based Fusion Network (dubbed SGFN), which enhances the images'/texts' features through intra- and cross-modal fusion for image-text retrieval. To be specific, we design an intra-modal hierarchical attention fusion to incorporate semantic contexts, such as objects, attributes, and relationships, into images'/texts' feature vectors via scene graphs, and a cross-modal attention fusion to combine the contextual semantics and local fusion via contextual vectors. Extensive experiments on public datasets Flickr30K and MSCOCO show that our SGFN performs better than quite a few SOTA image-text retrieval methods.
Authors
(none)
Tags
Stats
Related papers
- A Deep Local And Global Scene-graph Matching For Image-text Retrieval (2021)10.74
- Multi-modal Reasoning Graph For Scene-text Based Fine-grained Image Classification And Retrieval (2020)11.29
- Visual Semantic Reasoning For Image-text Matching (2019)25.23
- Beyond Visual Semantics: Exploring The Role Of Scene Text In Image Understanding (2019)9.59
- Modeling Text With Graph Convolutional Network For Cross-modal Information Retrieval (2018)11.85
- Scene Text Retrieval Via Joint Text Detection And Similarity Learning (2021)16.16
- HGAN: Hierarchical Graph Alignment Network For Image-text Retrieval (2022)11.93
- Far-net: Multi-stage Fusion Network With Enhanced Semantic Alignment And Adaptive Reconciliation For Composed Image Retrieval (2025)0.00