From Latent To Engine Manifolds: Analyzing Imagebind's Multimodal Embedding Space
2024 Β· Andrew Hamara, Pablo Rivas
Abstract
This study investigates ImageBind's ability to generate meaningful fused multimodal embeddings for online auto parts listings. We propose a simplistic embedding fusion workflow that aims to capture the overlapping information of image/text pairs, ultimately combining the semantics of a post into a joint embedding. After storing such fused embeddings in a vector database, we experiment with dimensionality reduction and provide empirical evidence to convey the semantic quality of the joint embeddings by clustering and examining the posts nearest to each cluster centroid. Additionally, our initial findings with ImageBind's emergent zero-shot cross-modal retrieval suggest that pure audio embeddings can correlate with semantically similar marketplace listings, indicating potential avenues for future research.
Authors
(none)
Tags
Stats
Related papers
- Imagebind: One Embedding Space To Bind Them All (2023)21.38
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Cross-modal Image Retrieval With Deep Mutual Information Maximization (2021)9.59
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Bundle Optimization For Multi-aspect Embedding (2017)0.00
- Embedding Arithmetic Of Multimodal Queries For Image Retrieval (2021)9.03
- Specializing Joint Representations For The Task Of Product Recommendation (2017)8.35