Advancing Compositional Awareness In CLIP With Efficient Fine-tuning
2025 Β· Amit Peleg, Naman Deep Singh, Matthias Hein
Abstract
Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CL
Authors
(none)
Tags
Stats
Related papers
- Semantic Compositions Enhance Vision-language Contrastive Learning (2024)0.00
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)16.84
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Adding Simple Structure At Inference Improves Vision-language Compositionality (2025)0.00
- Prompting Large Vision-language Models For Compositional Reasoning (2024)0.00
- Compositional Image-text Matching And Retrieval By Grounding Entities (2025)0.60