I0T: Embedding Standardization Method Towards Zero Modality Gap
2024 Β· Na Min An, Eunki Kim, James Thorne, et al.
Abstract
Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, \(\text\{I0T\}_\{\text\{post\}\}\) that reduces the modality gap approximately to zero and (2) a trainable method, \(\text\{I0T\}_\{\text\{async\}\}\), to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with t
Authors
(none)
Tags
Stats
Related papers
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- C-CLIP: Contrastive Image-text Encoders To Close The Descriptive-commentative Gap (2023)0.00
- Cross-modal Retrieval Meets Inference:improving Zero-shot Classification With Cross-modal Retrieval (2023)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06
- Fill The Gap: Quantifying And Reducing The Modality Gap In Image-text Representation Learning (2025)0.00