Guiding Cross-modal Representations With MLLM Priors Via Preference Alignment
2025 Β· Pengfei Zhao, Rongbo Luan, Wei Zhang, et al.
Abstract
Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embed
Authors
(none)
Tags
Stats
Related papers
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Compressing Then Matching: An Efficient Pre-training Paradigm For Multimodal Embedding (2025)0.00
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00