Simple Policy Gradients For Reasoning With Diffusion Language Models

Abstract

Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9% absolute gain on GSM8K, +4.6% on MATH-500, +59.4% on Countdown, and +69.7% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Fur

Simple Policy Gradients For Reasoning With Diffusion Language Models

Abstract

Authors

Tags

Stats

Related papers