CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions
2024 Β· Yanqing Liu, Xianhang Li, Zeyu Wang, et al.
Abstract
Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreov
Authors
(none)
Tags
Stats
Related papers
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- Fine-tuning CLIP Text Encoders With Two-step Paraphrasing (2024)2.26
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33