Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval
2023 Β· Yi Bin, Haoxuan Li, Yahui Xu, et al.
Abstract
Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit\{e.g.\}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf\{Hierarchical Alignment Transformers (HAT)\}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments betwee
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)19.48
- Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features (2021)6.34
- Hit: Hierarchical Transformer With Momentum Contrast For Video-text Retrieval (2021)15.98
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Multi-modal Transformer For Video Retrieval (2020)19.47
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00