Performance Bounds For Policy-based Average Reward Reinforcement Learning Algorithms
2023 Β· Yashaswini Murthy, Mehrdad Moharrami, R. Srikant
Abstract
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to \(1,\) which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the asy
Authors
(none)
Tags
Stats
Related papers
- Average-reward Reinforcement Learning With Trust Region Methods (2021)0.00
- Examining Average And Discounted Reward Optimality Criteria In Reinforcement Learning (2021)0.00
- Learning Fair Policies In Multiobjective (deep) Reinforcement Learning With Average And Discounted Rewards (2020)0.00
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00
- Inverse Reinforcement Learning With The Average Reward Criterion (2023)0.00
- Sharper Model-free Reinforcement Learning For Average-reward Markov Decision Processes (2023)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- ACPO: A Policy Optimization Algorithm For Average Mdps With Constraints (2023)0.00