Objembed: Towards Universal Multimodal Object Embeddings
2026 Β· Shenghao Fu, Yukun Su, Fengyun Rao, et al.
Abstract
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarit
Authors
(none)
Tags
Stats
Related papers
- Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task (2019)2.26
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- MULE: Multimodal Universal Language Embedding (2019)9.03
- OLIVE: Object Level In-context Visual Embeddings (2024)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Rzenembed: Towards Comprehensive Multimodal Retrieval (2025)0.00
- Unimoco: Unified Modality Completion For Robust Multi-modal Embeddings (2025)1.40