Noise-corrected GRPO: From Noisy Rewards To Unbiased Gradients
2025 Β· Omar El Mansouri, Mohamed El Amine Seddik, Salem Lahlou
Abstract
Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up t
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends (2025)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Balanced Aggregation: Understanding And Fixing Aggregation Bias In GRPO (2026)0.00
- MMR-GRPO: Accelerating Grpo-style Training Through Diversity-aware Reward Reweighting (2026)0.00