Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning
2024 Β· Can Yaras, Siyi Chen, Peng Wang, et al.
Abstract
Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are
Authors
(none)
Tags
Stats
Related papers
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- A Mathematical Perspective On Contrastive Learning (2025)0.00
- Guiding Cross-modal Representations With MLLM Priors Via Preference Alignment (2025)0.00
- Compressing Then Matching: An Efficient Pre-training Paradigm For Multimodal Embedding (2025)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00
- I0T: Embedding Standardization Method Towards Zero Modality Gap (2024)0.00