PLUME: Latent Reasoning Based Universal Multimodal Embedding
2026 Β· Chenwei He, Xiangzhao Hao, Tianyu Yang, et al.
Abstract
Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this be
Authors
(none)
Tags
Stats
Related papers
- Embed-rl: Reinforcement Learning For Reasoning-driven Multimodal Embeddings (2026)0.00
- Reasoning-augmented Representations For Multimodal Retrieval (2026)0.00
- TRACE: Task-adaptive Reasoning And Representation Learning For Universal Multimodal Retrieval (2026)0.00
- Reasoning Guided Embeddings: Leveraging MLLM Reasoning For Improved Multimodal Retrieval (2025)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Unime-v2: Mllm-as-a-judge For Universal Multimodal Embedding Learning (2025)0.00