Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO
2025 Β· Peter Chen, Xiaopeng Li, Ziniu Li, et al.
Abstract
Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a step-wise judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO's learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on
Authors
(none)
Tags
Stats
Related papers
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- Simple Policy Gradients For Reasoning With Diffusion Language Models (2025)0.00
- MMR-GRPO: Accelerating Grpo-style Training Through Diversity-aware Reward Reweighting (2026)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends (2025)0.00