MCAD: Multi-teacher Cross-modal Alignment Distillation For Efficient Image-text Retrieval
2023 Β· Youbo Lei, Feifei He, Chen Chen, et al.
Abstract
Due to the success of large-scale visual-language pretraining (VLP) models and the widespread use of image-text retrieval in industry areas, it is now critically necessary to reduce the model size and streamline their mobile-device deployment. Single- and dual-stream model structures are commonly used in image-text retrieval with the goal of closing the semantic gap between textual and visual modalities. While single-stream models use deep feature fusion to achieve more accurate cross-model alignment, dual-stream models are better at offline indexing and fast inference.We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher similarity distributions and features. Then, we conduct both distribution and feature distillation to boost the capability of the studen
Authors
(none)
Tags
Stats
Related papers
- AMMKD: Adaptive Multimodal Multi-teacher Distillation For Lightweight Vision-language Models (2025)0.00
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- C2KD: Cross-lingual Cross-modal Knowledge Distillation For Multilingual Text-video Retrieval (2022)8.94
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- ALADIN: Distilling Fine-grained Alignment Scores For Efficient Image-text Matching And Retrieval (2022)14.00
- Conaclip: Exploring Distillation Of Fully-connected Knowledge Interaction Graph For Lightweight Text-image Retrieval (2023)4.52