Natural Policy Gradient For Average Reward Non-stationary RL
2025 Β· Neharika Jali, Eshika Pathak, Pranay Sharma, et al.
Abstract
We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of \(\Delta_T\). Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget \(\Delta_T\). We present a dynamic regret of \(\tilde\{\mathscr O\}(|S|^\{1/2\}|A|^\{1/2\}\Delta_T^\{1/6\}T^\{5/6\})\) for both algorithms,
Authors
(none)
Tags
Stats
Related papers
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- A Sharper Global Convergence Analysis For Average Reward Reinforcement Learning Via An Actor-critic Approach (2024)0.00
- A Nearly Blackwell-optimal Policy Gradient Method (2021)0.00
- Federated Natural Policy Gradient And Actor Critic Methods For Multi-task Reinforcement Learning (2023)0.00
- Natural Policy Gradient And Actor Critic Methods For Constrained Multi-task Reinforcement Learning (2024)0.00
- Neural Network Compatible Off-policy Natural Actor-critic Algorithm (2021)0.00
- Recurrent Natural Policy Gradient For Pomdps (2024)0.00