COSMOS: Cross-modality Self-distillation For Vision Language Pre-training
2024 Β· Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, et al.
Abstract
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retriev
Authors
(none)
Tags
Stats
Related papers
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Cosmoclip: Generalizing Large Vision-language Models For Astronomical Imaging (2024)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- SILC: Improving Vision Language Pretraining With Self-distillation (2023)10.21
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32