CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization
2025 Β· Yingrui Ji, Xi Xiao, Gaofei Chen, et al.
Abstract
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing seman
Authors
(none)
Tags
Stats
Related papers
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- Cross-modal Retrieval Meets Inference:improving Zero-shot Classification With Cross-modal Retrieval (2023)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- I0T: Embedding Standardization Method Towards Zero Modality Gap (2024)0.00
- \(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment (2025)2.16
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Superclip: CLIP With Simple Classification Supervision (2025)0.00