Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features
2021 Β· Nicola Messina, Giuseppe Amato, Fabrizio Falchi, et al.
Abstract
Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)19.48
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Is Cross-modal Information Retrieval Possible Without Training? (2023)0.00
- Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval (2023)13.89