CLIP-PING: Boosting Lightweight Vision-language Models With Proximus Intrinsic Neighbors Guidance
2024 Β· Chu Myaet Thwal, Ye Lin Tun, Minh N. H. Nguyen, et al.
Abstract
Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a novel yet simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision fro
Authors
(none)
Tags
Stats
Related papers
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Superclip: CLIP With Simple Classification Supervision (2025)0.00