Understanding The Pathologies Of Approximate Policy Evaluation When Combined With Greedification In Reinforcement Learning
2020 Β· Kenny Young, Richard S. Sutton
Abstract
Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribu
Authors
(none)
Tags
Stats
Related papers
- The Role Of Lookahead And Approximate Policy Evaluation In Reinforcement Learning With Linear Value Function Approximation (2021)0.00
- Greedification Operators For Policy Optimization: Investigating Forward And Reverse KL Divergences (2021)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Guarantees For Epsilon-greedy Reinforcement Learning With Function Approximation (2022)0.00
- Bad Habits: Policy Confounding And Out-of-trajectory Generalization In RL (2023)0.00
- Multi-step Greedy Reinforcement Learning Algorithms (2019)0.00
- Improving Deep Reinforcement Learning By Reducing The Chain Effect Of Value And Policy Churn (2024)0.00
- Actor-critic Policy Optimization In Partially Observable Multiagent Environments (2018)0.00