ARGENT: Adaptive Hierarchical Image-text Representations
2026 Β· Chuong Huynh, Hossein Souri, Abhinav Kumar, et al.
Abstract
Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-base
Authors
(none)
Tags
Stats
Related papers
- Learning Visual Hierarchies In Hyperbolic Space For Image Retrieval (2024)0.00
- Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment (2025)3.01
- Hyperbolic Image-text Representations (2023)4.61
- Hyperbolic Hierarchical Alignment Reasoning Network For Text-3d Retrieval (2025)1.81
- Linear Spaces Of Meanings: Compositional Structures In Vision-language Models (2023)9.41
- Hierloc: Hyperbolic Entity Embeddings For Hierarchical Visual Geolocation (2026)0.00
- Probvlm: Probabilistic Adapter For Frozen Vision-language Models (2023)13.41
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11