Dynamic Contrastive Distillation For Image-text Retrieval
2022 Β· Jun Rao, Liang Ding, Shuhan Qi, et al.
Abstract
Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restricts its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to the cross-modal tasks, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve m
Authors
(none)
Tags
Stats
Related papers
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- MCAD: Multi-teacher Cross-modal Alignment Distillation For Efficient Image-text Retrieval (2023)3.58
- CODER: Coupled Diversity-sensitive Momentum Contrastive Learning For Image-text Retrieval (2022)13.72
- Lightningdot: Pre-training Visual-semantic Embeddings For Real-time Image-text Retrieval (2021)17.42
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- CLIP-KD: An Empirical Study Of CLIP Model Distillation (2023)17.57
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43