Simple Policy Gradients For Reasoning With Diffusion Language Models
2025 Β· Anthony Zhan
Abstract
Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9% absolute gain on GSM8K, +4.6% on MATH-500, +59.4% on Countdown, and +69.7% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Fur
Authors
(none)
Tags
Stats
Related papers
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- SPG: Sandwiched Policy Gradient For Masked Diffusion Language Models (2025)0.00
- Wd1: Weighted Policy Optimization For Reasoning In Diffusion Language Models (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Inpainting-guided Policy Optimization For Diffusion Large Language Models (2025)0.00
- Diffpo: Training Diffusion Llms To Reason Fast And Furious Via Reinforcement Learning (2025)0.00
- Dichotomous Diffusion Policy Optimization (2025)0.00
- Diffusion Policy Through Conditional Proximal Policy Optimization (2026)0.00