CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language
2023 Β· Shentong Mo, Jingfei Xia, Ihor Markevych
Abstract
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this work, we present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn alignments between the whole sentence and each image in the same batch during the pre-training process. At the fine-tuning stage, we introduce two lightweight adaptation networks to reduce model parameters and increase training speed for saving computation resources. We evaluate our CAVL on six main downstream tasks, including Visual Question Answering (VQA), Visual Commonsense Reasoning (VC
Authors
(none)
Tags
Stats
Related papers
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- COSMOS: Cross-modality Self-distillation For Vision Language Pre-training (2024)10.02
- Vilbert: Pretraining Task-agnostic Visiolinguistic Representations For Vision-and-language Tasks (2019)0.00
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69