Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training
2024 Β· Haicheng Wang, Chen Ju, Weixiong Lin, et al.
Abstract
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, divers
Authors
(none)
Tags
Stats
Related papers
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment (2025)3.01
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00