Conaclip: Exploring Distillation Of Fully-connected Knowledge Interaction Graph For Lightweight Text-image Retrieval
2023 Β· Jiapeng Wang, Chengyu Wang, Xiaodan Wang, et al.
Abstract
Large-scale pre-trained text-image models with dual-encoder architectures (such as CLIP) are typically adopted for various vision-language applications, including text-image retrieval. However,these models are still less practical on edge devices or for real-time situations, due to the substantial indexing and inference time and the large consumption of computational resources. Although knowledge distillation techniques have been widely utilized for uni-modal model compression, how to expand them to the situation when the numbers of modalities and teachers/students are doubled has been rarely studied. In this paper, we conduct comprehensive experiments on this topic and propose the fully-Connected knowledge interaction graph (Cona) technique for cross-modal pre-training distillation. Based on our findings, the resulting ConaCLIP achieves SOTA performances on the widely-used Flickr30K and MSCOCO benchmarks under the lightweight setting. An industry application of our method on an e-comm
Authors
(none)
Tags
Stats
Related papers
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- CLIP-KD: An Empirical Study Of CLIP Model Distillation (2023)17.57
- AMMKD: Adaptive Multimodal Multi-teacher Distillation For Lightweight Vision-language Models (2025)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- MCAD: Multi-teacher Cross-modal Alignment Distillation For Efficient Image-text Retrieval (2023)3.58
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12
- Leaner And Faster: Two-stage Model Compression For Lightweight Text-image Retrieval (2022)6.34