Contrastive Vision-language Learning With Paraphrasing And Negation
2025 Β· Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, et al.
Abstract
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is sho
Authors
(none)
Tags
Stats
Related papers
- Fine-tuning CLIP Text Encoders With Two-step Paraphrasing (2024)2.26
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- The Effect Of Negation On CLIP In Medical Imaging: Limitations Of Contrastive Language-image Pretraining (2025)0.00
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Tng-clip:training-time Negation Data Generation For Negation Awareness Of CLIP (2025)0.00
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Adapting Dual-encoder Vision-language Models For Paraphrased Retrieval (2024)0.00