NGRPO: Negative-enhanced Group Relative Policy Optimization
2025 Β· Gongrui Nan, Siye Chen, Jing Huang, et al.
Abstract
RLVR has enhanced the reasoning capabilities of Large Language Models (LLMs) across various tasks. However, GRPO, a representative RLVR algorithm, suffers from a critical limitation: when all responses within a group are either entirely correct or entirely incorrect, the model fails to learn from these homogeneous responses. This is particularly problematic for homogeneously incorrect groups, where GRPO's advantage function yields a value of zero, leading to null gradients and the loss of valuable learning signals. To overcome this issue, we propose NGRPO (Negative-enhanced Group Relative Policy Optimization), an algorithm designed to convert homogeneous errors into robust learning signals. First, NGRPO introduces Advantage Calibration. This mechanism hypothesizes the existence of a virtual maximum-reward sample during advantage calculation, thereby altering the mean and variance of rewards within a group and ensuring that the advantages for homogeneously incorrect samples are no longe
Authors
(none)
Tags
Stats
Related papers
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Noise-corrected GRPO: From Noisy Rewards To Unbiased Gradients (2025)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends (2025)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Hybrid Group Relative Policy Optimization: A Multi-sample Approach To Enhancing Policy Optimization (2025)0.00
- Balanced Aggregation: Understanding And Fixing Aggregation Bias In GRPO (2026)0.00