Abstract

Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations104
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score15.16
  • arxiv keymiech2021thinking

Related papers

Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers β€” learning-to-hash