Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks
2024 Β· Ziyan Jiang, Rui Meng, Xinyi Yang, et al.
Abstract
Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite its importance and practicality. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets covering both in-distribution and out-of-distribution tasks, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-
Authors
(none)
Tags
Stats
Related papers
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- MULE: Multimodal Universal Language Embedding (2019)9.03
- Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning (2025)3.58
- Analyzing Diffusion And Autoregressive Vision Language Models In Multimodal Embedding Space (2026)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00