Optimal Sample Complexity For Average Reward Markov Decision Processes
2023 Β· Shengbo Wang, Jose Blanchet, Peter Glynn
Abstract
We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of \(\widetilde O(|S||A|t_\{\text\{mix\}\}^2 \epsilon^\{-2\})\) and a lower bound of \(Ξ©(|S||A|t_\{\text\{mix\}\} \epsilon^\{-2\})\). In these expressions, \(|S|\) and \(|A|\) denote the cardinalities of the state and action spaces respectively, \(t_\{\text\{mix\}\}\) serves as a uniform upper limit for the total variation mixing times, and \(\epsilon\) signifies the error tolerance. Therefore, a notable gap of \(t_\{\text\{mix\}\}\) still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of \(\widetilde O(|S||A|t_\{\text\{mix\}\}\epsilon^\{-2\})\). This marks the first algorithm and anal
Authors
(none)
Tags
Stats
Related papers
- Near Sample-optimal Reduction-based Policy Learning For Average Reward MDP (2022)0.00
- Span-based Optimal Sample Complexity For Average Reward Mdps (2023)0.00
- Span-based Optimal Sample Complexity For Weakly Communicating And General Average Reward Mdps (2024)0.00
- Sample Complexity Of Distributionally Robust Average-reward Reinforcement Learning (2025)0.00
- Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity (2020)0.00
- Projection By Convolution: Optimal Sample Complexity For Reinforcement Learning In Continuous-space Mdps (2024)0.00
- Reinforcement Learning In Reward-mixing Mdps (2021)0.00
- Adaptive Sampling For Best Policy Identification In Markov Decision Processes (2020)0.00