Natural Policy Gradient For Average Reward Non-stationary RL

Abstract

We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of \(\Delta_T\). Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget \(\Delta_T\). We present a dynamic regret of \(\tilde\{\mathscr O\}(|S|^\{1/2\}|A|^\{1/2\}\Delta_T^\{1/6\}T^\{5/6\})\) for both algorithms,

Natural Policy Gradient For Average Reward Non-stationary RL

Abstract

Authors

Tags

Stats

Related papers