Clip-moe: Towards Building Mixture Of Experts For CLIP With Diversified Multiplet Upcycling
2024 Β· Jihai Zhang, Xiaoye Qu, Tong Zhu, et al.
Abstract
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational c
Authors
(none)
Tags
Stats
Related papers
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- CLIP-KD: An Empirical Study Of CLIP Model Distillation (2023)17.57
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Efficientclip: Efficient Cross-modal Pre-training By Ensemble Confident Learning And Language Modeling (2021)0.00
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12
- FLEX-CLIP: Feature-level Generation Network Enhanced CLIP For X-shot Cross-modal Retrieval (2024)0.00