Abstract

Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit\{e.g.\}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf\{Hierarchical Alignment Transformers (HAT)\}, which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments betwee

Authors

(none)

Tags

  • Cross-Modal Hashing
  • Image Retrieval

Stats

  • citations28
  • S2 citationsβ€”
  • github stars28
  • HF likes0
  • heat score13.89
  • arxiv keybin2023unifying

Related papers