Unime-v2: Mllm-as-a-judge For Universal Multimodal Embedding Learning
2025 Β· Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, et al.
Abstract
Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identificati
Authors
(none)
Tags
Stats
Related papers
- Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning (2025)3.58
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00
- Improve Multi-modal Embedding Learning Via Explicit Hard Negative Gradient Amplifying (2025)2.80
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00