Finetuning CLIP To Reason About Pairwise Differences
2024 Β· Dylan Sam, Devin Willmott, Joao D. Semedo, et al.
Abstract
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of its purely text-based alternatives. For instance, while text embeddings have long been noted to satisfy analogies in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space, using synthetically generated data with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants ar
Authors
(none)
Tags
Stats
Related papers
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Fine-tuning CLIP Text Encoders With Two-step Paraphrasing (2024)2.26
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11