Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation
2025 Β· Daniel Csizmadia, Andrei Codreanu, Victor Sim, et al.
Abstract
We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CL
Authors
(none)
Tags
Stats
Related papers
- CLIP-KD: An Empirical Study Of CLIP Model Distillation (2023)17.57
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- Clip-moe: Towards Building Mixture Of Experts For CLIP With Diversified Multiplet Upcycling (2024)2.26
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Conaclip: Exploring Distillation Of Fully-connected Knowledge Interaction Graph For Lightweight Text-image Retrieval (2023)4.52
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12