Demystifying Group Relative Policy Optimization: Its Policy Gradient Is A U-statistic
2026 Β· Hongyi Zhou, Kai Ye, Erhan Xu, et al.
Abstract
Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm -- one with access to a value function that quantifies the goodness of its learning policy at each training iteration -- and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furt
Authors
(none)
Tags
Stats
Related papers
- Hybrid Group Relative Policy Optimization: A Multi-sample Approach To Enhancing Policy Optimization (2025)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)0.00
- Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends (2025)0.00
- Beyond KL Divergence: Policy Optimization With Flexible Bregman Divergences For LLM Reasoning (2026)0.00
- MMR-GRPO: Accelerating Grpo-style Training Through Diversity-aware Reward Reweighting (2026)0.00