Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment
2026 Β· Simone Magistri, Dipam Goswami, Marco Mistretta, et al.
Abstract
Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned,
Authors
(none)
Tags
Stats
Related papers
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment (2025)3.01