Taming Masked Diffusion Language Models Via Consistency Trajectory Reinforcement Learning With Fewer Decoding Step
2025 Β· Jingyi Yang, Guanxu Chen, Xuhao Hu, et al.
Abstract
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To addre
Authors
(none)
Tags
Stats
Related papers
- MDPO: Overcoming The Training-inference Divide Of Masked Diffusion Language Models (2025)0.00
- SPG: Sandwiched Policy Gradient For Masked Diffusion Language Models (2025)0.00
- Inpainting-guided Policy Optimization For Diffusion Large Language Models (2025)0.00
- Simple Policy Gradients For Reasoning With Diffusion Language Models (2025)0.00
- Avoiding Mode Collapse In Diffusion Models Fine-tuned With Reinforcement Learning (2024)0.00
- Madiff: Offline Multi-agent Learning With Diffusion Models (2023)2.26
- Diffpo: Training Diffusion Llms To Reason Fast And Furious Via Reinforcement Learning (2025)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00