AMMKD: Adaptive Multimodal Multi-teacher Distillation For Lightweight Vision-language Models
2025 Β· Yuqi Li, Chuanguang Yang, Junhao Dong, et al.
Abstract
The success of large-scale visual language pretraining (VLP) models has driven widespread adoption of image-text retrieval tasks. However, their deployment on mobile devices remains limited due to large model sizes and computational complexity. We propose Adaptive Multi-Modal Multi-Teacher Knowledge Distillation (AMMKD), a novel framework that integrates multi-modal feature fusion, multi-teacher distillation, and adaptive optimization to deliver lightweight yet effective retrieval models. Specifically, our method begins with a feature fusion network that extracts and merges discriminative features from both the image and text modalities. To reduce model parameters and further improve performance, we design a multi-teacher knowledge distillation framework to pre-train two CLIP teacher models. We decouple modalities by pre-computing and storing text features as class vectors via the teacher text encoder to enhance efficiency. To better align teacher and student outputs, we apply KL scatt
Authors
(none)
Tags
Stats
Related papers
- MCAD: Multi-teacher Cross-modal Alignment Distillation For Efficient Image-text Retrieval (2023)3.58
- CLIP-KD: An Empirical Study Of CLIP Model Distillation (2023)17.57
- Conaclip: Exploring Distillation Of Fully-connected Knowledge Interaction Graph For Lightweight Text-image Retrieval (2023)4.52
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- Let All Be Whitened: Multi-teacher Distillation For Efficient Visual Retrieval (2023)8.86
- C2KD: Cross-lingual Cross-modal Knowledge Distillation For Multilingual Text-video Retrieval (2022)8.94
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00