Hyperbolic Image-text Representations
2023 Β· Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, et al.
Abstract
Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru
Authors
(none)
Tags
Stats
Related papers
- ARGENT: Adaptive Hierarchical Image-text Representations (2026)0.00
- Learning Visual Hierarchies In Hyperbolic Space For Image Retrieval (2024)0.00
- Hyperbolic Hierarchical Alignment Reasoning Network For Text-3d Retrieval (2025)1.81
- Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment (2025)3.01
- Using Text To Teach Image Retrieval (2020)5.24
- Hyperbolic Image Embeddings (2019)17.91
- Embedding Arithmetic Of Multimodal Queries For Image Retrieval (2021)9.03
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52