Embed-rl: Reinforcement Learning For Reasoning-driven Multimodal Embeddings
2026 Β· Haonan Jiang, Yuji Wang, Yongjie Zhu, et al.
Abstract
Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cue
Authors
(none)
Tags
Stats
Related papers
- Reasoning Guided Embeddings: Leveraging MLLM Reasoning For Improved Multimodal Retrieval (2025)0.00
- TRACE: Task-adaptive Reasoning And Representation Learning For Universal Multimodal Retrieval (2026)0.00
- PLUME: Latent Reasoning Based Universal Multimodal Embedding (2026)0.00
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Reasoning-augmented Representations For Multimodal Retrieval (2026)0.00
- U-MARVEL: Unveiling Key Factors For Universal Multimodal Retrieval Via Embedding Learning With Mllms (2025)3.11
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Rzenembed: Towards Comprehensive Multimodal Retrieval (2025)0.00