Optimal Sample Complexity For Average Reward Markov Decision Processes

Abstract

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of \(\widetilde O(|S||A|t_\{\text\{mix\}\}^2 \epsilon^\{-2\})\) and a lower bound of \(Ω(|S||A|t_\{\text\{mix\}\} \epsilon^\{-2\})\). In these expressions, \(|S|\) and \(|A|\) denote the cardinalities of the state and action spaces respectively, \(t_\{\text\{mix\}\}\) serves as a uniform upper limit for the total variation mixing times, and \(\epsilon\) signifies the error tolerance. Therefore, a notable gap of \(t_\{\text\{mix\}\}\) still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of \(\widetilde O(|S||A|t_\{\text\{mix\}\}\epsilon^\{-2\})\). This marks the first algorithm and anal

Optimal Sample Complexity For Average Reward Markov Decision Processes

Abstract

Authors

Tags

Stats

Related papers