Unimoco: Unified Modality Completion For Robust Multi-modal Embeddings
2025 Β· Jiajun Qin, Yuan Pu, Zhuolun He, et al.
Abstract
Current research has explored vision-language models for multi-modal embedding tasks, such as information retrieval, visual grounding, and classification. However, real-world scenarios often involve diverse modality combinations between queries and targets, such as text and image to text, text and image to text and image, and text to text and image. These diverse combinations pose significant challenges for existing models, as they struggle to align all modality combinations within a unified embedding space during training, which degrades performance at inference. To address this limitation, we propose UniMoCo, a novel vision-language model architecture designed for multi-modal embedding tasks. UniMoCo introduces a modality-completion module that generates visual features from textual inputs, ensuring modality completeness for both queries and targets. Additionally, we develop a specialized training strategy to align embeddings from both original and modality-completed inputs, ensuring
Authors
(none)
Tags
Stats
Related papers
- Muco: Multi-turn Contrastive Learning For Multimodal Embedding Model (2026)2.71
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- MULE: Multimodal Universal Language Embedding (2019)9.03
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- Objembed: Towards Universal Multimodal Object Embeddings (2026)0.00