Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms
2025 Β· Tiancheng Gu, Kaicheng Yang, Ziyong Feng, et al.
Abstract
The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage,
Authors
(none)
Tags
Stats
Related papers
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Compressing Then Matching: An Efficient Pre-training Paradigm For Multimodal Embedding (2025)0.00
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Muco: Multi-turn Contrastive Learning For Multimodal Embedding Model (2026)2.71
- Guiding Cross-modal Representations With MLLM Priors Via Preference Alignment (2025)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11