Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding
2023 Β· Le Zhang, Rabiul Awal, Aishwarya Agrawal
Abstract
Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable impro
Authors
(none)
Tags
Stats
Related papers
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Semantic Compositions Enhance Vision-language Contrastive Learning (2024)0.00
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00
- Prompting Large Vision-language Models For Compositional Reasoning (2024)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Bivlc: Extending Vision-language Compositionality Evaluation With Text-to-image Retrieval (2024)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69