Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment
2025 Β· Ruijia Wu, Ping Chen, Fei Shen, et al.
Abstract
Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a mono
Authors
(none)
Tags
Stats
Related papers
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- ARGENT: Adaptive Hierarchical Image-text Representations (2026)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- \(\beta\)-clip: Text-conditioned Contrastive Learning For Multi-granular Vision-language Alignment (2025)2.16