Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning
2025 Β· Zhibin Lan, Liqiang Niu, Fandong Meng, et al.
Abstract
Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the
Authors
(none)
Tags
Stats
Related papers
- Unime-v2: Mllm-as-a-judge For Universal Multimodal Embedding Learning (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Improve Multi-modal Embedding Learning Via Explicit Hard Negative Gradient Amplifying (2025)2.80
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Nv-embed: Improved Techniques For Training Llms As Generalist Embedding Models (2024)0.00
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00