Goodhart's Law In Reinforcement Learning
2023 Β· Jacek Karwowski, Oliver Hayman, Xingjian Bai, et al.
Abstract
Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that
Authors
(none)
Tags
Stats
Related papers
- The Perils Of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret (2024)0.00
- Pitfalls Of Learning A Reward Function Online (2020)4.52
- Reward Design For Reinforcement Learning Agents (2025)0.00
- Informativeness Of Reward Functions In Reinforcement Learning (2024)2.26
- On The Expressivity Of Markov Reward (2021)0.00
- Reward Tweaking: Maximizing The Total Reward While Planning For Short Horizons (2020)0.00
- Invariance In Policy Optimisation And Partial Identifiability In Reward Learning (2022)0.00
- When Errors Can Be Beneficial: A Categorization Of Imperfect Rewards For Policy Gradient (2026)0.00