Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion
2025 Β· Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, et al.
Abstract
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, w
Authors
(none)
Tags
Stats
Related papers
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00