Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers
2021 Β· Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, et al.
Abstract
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but
Authors
(none)
Tags
Stats
Related papers
- Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features (2021)6.34
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Training Vision Transformers For Image Retrieval (2021)0.00
- Boosting Vision Transformers For Image Retrieval (2022)15.28
- Dual Encoding For Video Retrieval By Text (2020)16.05
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78