Semantic Compositions Enhance Vision-language Contrastive Learning
2024 Β· Maxwell Aladago, Lorenzo Torresani, Soroush Vosoughi
Abstract
In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification an
Authors
(none)
Tags
Stats
Related papers
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00
- Advancing Compositional Awareness In CLIP With Efficient Fine-tuning (2025)0.00
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)16.84
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Compositional Image-text Matching And Retrieval By Grounding Entities (2025)0.60
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02