Transformer Reasoning Network For Image-text Matching And Retrieval
2020 Β· Nicola Messina, Fabrizio Falchi, Andrea Esuli, et al.
Abstract
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)19.48
- Matching Images And Text With Multi-modal Tensor Fusion And Re-ranking (2019)19.77
- Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features (2021)6.34
- ALADIN: Distilling Fine-grained Alignment Scores For Efficient Image-text Matching And Retrieval (2022)14.00
- Visual Semantic Reasoning For Image-text Matching (2019)25.23
- Cross-modal Implicit Relation Reasoning And Aligning For Text-to-image Person Retrieval (2023)18.15
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Recurrence Meets Transformers For Universal Multimodal Retrieval (2025)2.41