Stabilizing Reinforcement Learning For Diffusion Language Models
2026 Β· Jianyuan Zhong, Kaibo Wang, Ding Ding, et al.
Abstract
Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLL
Authors
(none)
Tags
Stats
Related papers
- Simple Policy Gradients For Reasoning With Diffusion Language Models (2025)0.00
- Inpainting-guided Policy Optimization For Diffusion Large Language Models (2025)0.00
- SPG: Sandwiched Policy Gradient For Masked Diffusion Language Models (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Dichotomous Diffusion Policy Optimization (2025)0.00
- Wd1: Weighted Policy Optimization For Reasoning In Diffusion Language Models (2025)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- AEGPO: Adaptive Entropy-guided Policy Optimization For Diffusion Models (2026)0.00