SPG: Sandwiched Policy Gradient For Masked Diffusion Language Models
2025 Β· Chenyu Wang, Paria Rashidinejad, Dijia Su, et al.
Abstract
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
Authors
(none)
Tags
Stats
Related papers
- Simple Policy Gradients For Reasoning With Diffusion Language Models (2025)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- Inpainting-guided Policy Optimization For Diffusion Large Language Models (2025)0.00
- MDPO: Overcoming The Training-inference Divide Of Masked Diffusion Language Models (2025)0.00
- Diffpo: Training Diffusion Llms To Reason Fast And Furious Via Reinforcement Learning (2025)0.00
- Wd1: Weighted Policy Optimization For Reasoning In Diffusion Language Models (2025)0.00
- Efficient Differentially Private Fine-tuning Of Llms Via Reinforcement Learning (2025)0.00
- Taming Masked Diffusion Language Models Via Consistency Trajectory Reinforcement Learning With Fewer Decoding Step (2025)0.00