Muco: Multi-turn Contrastive Learning For Multimodal Embedding Model
2026 Β· Geonmo Gu, Byeongho Heo, Jaemyung Yu, et al.
Abstract
Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a sha
Authors
(none)
Tags
Stats
Related papers
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Unimoco: Unified Modality Completion For Robust Multi-modal Embeddings (2025)1.40
- Compressing Then Matching: An Efficient Pre-training Paradigm For Multimodal Embedding (2025)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00