CLIP-KD: An Empirical Study Of CLIP Model Distillation
2023 Β· Chuanguang Yang, Zhulin An, Libo Huang, et al.
Abstract
Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP
Authors
(none)
Tags
Stats
Related papers
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Conaclip: Exploring Distillation Of Fully-connected Knowledge Interaction Graph For Lightweight Text-image Retrieval (2023)4.52
- AMMKD: Adaptive Multimodal Multi-teacher Distillation For Lightweight Vision-language Models (2025)0.00
- Clip-moe: Towards Building Mixture Of Experts For CLIP With Diversified Multiplet Upcycling (2024)2.26
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- SILC: Improving Vision Language Pretraining With Self-distillation (2023)10.21
- Koo-fu CLIP: Closed-form Adaptation Of Vision-language Models Via Fukunaga-koontz Linear Discriminant Analysis (2026)0.00
- RECLIP: Resource-efficient CLIP By Training With Small Images (2023)0.00