Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval

Abstract

Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit\{e.g.\}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf\{Hierarchical Alignment Transformers (HAT)\}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments betwee

Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval

Abstract

Authors

Tags

Stats

Related papers