Stochastic First-order Methods For Average-reward Markov Decision Processes
2022 Β· Tianjiao Li, Feiyang Wu, Guanghui Lan
Abstract
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. Compared with intensive research efforts in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions, and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward stochastic policy mirror descent (SPMD) method for solving AMDPs with and without regularizers and provide convergence guarantees in terms of the long-term average reward. For policy evaluation, existing on-policy methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies due to the lack of exploration in the action space. To remedy these issues, we develop a variance-reduced temporal difference (VRTD) method with li
Authors
(none)
Tags
Stats
Related papers
- Near Sample-optimal Reduction-based Policy Learning For Average Reward MDP (2022)0.00
- Learning And Planning In Average-reward Markov Decision Processes (2020)0.00
- Learning General Parameterized Policies For Infinite Horizon Average Reward Constrained Mdps Via Primal-dual Policy Gradient Algorithm (2024)0.00
- Policy Gradient For Robust Markov Decision Processes (2024)0.00
- Optimal Sample Complexity For Average Reward Markov Decision Processes (2023)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- Optimal Convergence Rate For Exact Policy Mirror Descent In Discounted Markov Decision Processes (2023)0.00
- Sharper Model-free Reinforcement Learning For Average-reward Markov Decision Processes (2023)0.00