VITR: Augmenting Vision Transformers With Relation-focused Learning For Cross-modal Information Retrieval
2023 Β· Yan Gong, Georgina Cosma, Axel Finke
Abstract
The relations expressed in user queries are vital for cross-modal information retrieval. Relation-focused cross-modal retrieval aims to retrieve information that corresponds to these relations, enabling effective retrieval across different modalities. Pre-trained networks, such as Contrastive Language-Image Pre-training (CLIP), have gained significant attention and acclaim for their exceptional performance in various cross-modal learning tasks. However, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a local encoder. VITR is comprised of two key components. Firstly, it extends the capabilities of ViT-based cross-modal networks by e
Authors
(none)
Tags
Stats
Related papers
- Boosting Vision Transformers For Image Retrieval (2022)15.28
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Decomposing And Interpreting Image Representations Via Text In Vits Beyond CLIP (2024)7.28
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Training Vision Transformers For Image Retrieval (2021)0.00
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)19.48
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- Self-supervised Vision Transformers For Writer Retrieval (2024)5.24