Imagebind: One Embedding Space To Bind Them All
2023 Β· Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, et al.
Abstract
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate visio
Authors
(none)
Tags
Stats
Related papers
- From Latent To Engine Manifolds: Analyzing Imagebind's Multimodal Embedding Space (2024)0.00
- Efficient Discriminative Joint Encoders For Large Scale Vision-language Reranking (2025)0.00
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Learning Robust Visual-semantic Embeddings (2017)15.22
- ABC: Achieving Better Control Of Multimodal Embeddings Using Vlms (2025)0.00
- Textme: Bridging Unseen Modalities Through Text Descriptions (2026)0.00
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Tri-modal Motion Retrieval By Learning A Joint Embedding Space (2024)7.81