Fill The Gap: Quantifying And Reducing The Modality Gap In Image-text Representation Learning
2025 · François Role, Sébastien Meyer, Victor Amblard
Abstract
Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.
Authors
(none)
Tags
Stats
Related papers
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- I0T: Embedding Standardization Method Towards Zero Modality Gap (2024)0.00
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Category-level Text-to-image Retrieval Improved: Bridging The Domain Gap With Diffusion Models And Vision Encoders (2025)1.20