Compressing Then Matching: An Efficient Pre-training Paradigm For Multimodal Embedding
2025 Β· da Li, Yuxiao Luo, Keping Bi, et al.
Abstract
Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Expe
Authors
(none)
Tags
Stats
Related papers
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Muco: Multi-turn Contrastive Learning For Multimodal Embedding Model (2026)2.71
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Guiding Cross-modal Representations With MLLM Priors Via Preference Alignment (2025)0.00
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32