Covmatch: Cross-covariance Guided Multimodal Dataset Distillation With Trainable Text Encoder
2025 Β· Yongmin Lee, Hye Won Chung
Abstract
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outpe
Authors
(none)
Tags
Stats
Related papers
- MCAD: Multi-teacher Cross-modal Alignment Distillation For Efficient Image-text Retrieval (2023)3.58
- Modest-align: Data-efficient Alignment For Vision-language Models (2025)0.00
- Vision-language Dataset Distillation (2023)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- C2KD: Cross-lingual Cross-modal Knowledge Distillation For Multilingual Text-video Retrieval (2022)8.94
- The More, The Merrier: Contrastive Fusion For Higher-order Multimodal Alignment (2025)0.00