Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders
2020 Β· Nicola Messina, Giuseppe Amato, Andrea Esuli, et al.
Abstract
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links inval
Authors
(none)
Tags
Stats
Related papers
- Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features (2021)6.34
- Transformer Reasoning Network For Image-text Matching And Retrieval (2020)16.15
- Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval (2023)13.89
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- VITR: Augmenting Vision Transformers With Relation-focused Learning For Cross-modal Information Retrieval (2023)4.52
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97