Correlated Proxies: A New Definition And Improved Mitigation For Reward Hacking
2024 Β· Cassidy Laidlaw, Shivam Singhal, Anca Dragan
Abstract
Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking
Authors
(none)
Tags
Stats
Related papers
- The Effects Of Reward Misspecification: Mapping And Mitigating Misaligned Models (2022)0.00
- Causal Confusion And Reward Misidentification In Preference-based Reward Learning (2022)0.00
- REBEL: Reward Regularization-based Approach For Robotic Reinforcement Learning From Human Feedback (2023)0.00
- Going Beyond Heuristics By Imposing Policy Improvement As A Constraint (2025)0.00
- Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards (2024)0.00
- Goodhart's Law In Reinforcement Learning (2023)0.00
- When Errors Can Be Beneficial: A Categorization Of Imperfect Rewards For Policy Gradient (2026)0.00
- Provably Mitigating Overoptimization In RLHF: Your SFT Loss Is Implicitly An Adversarial Regularizer (2024)0.00