When Errors Can Be Beneficial: A Categorization Of Imperfect Rewards For Policy Gradient
2026 Β· Shuning Shang, Hubert Strauss, Stanley Wei, et al.
Abstract
arXiv:2604.25872v1 Announce Type: new Abstract: Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop rewa
Authors
(none)
Tags
Stats
Related papers
- The Perils Of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret (2024)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Noise-corrected GRPO: From Noisy Rewards To Unbiased Gradients (2025)0.00
- Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards (2024)0.00
- Can RLHF Be More Efficient With Imperfect Reward Models? A Policy Coverage Perspective (2025)0.00
- Causal Confusion And Reward Misidentification In Preference-based Reward Learning (2022)0.00
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00