Abstract

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of \(\widetilde O(|S||A|t_\{\text\{mix\}\}^2 \epsilon^\{-2\})\) and a lower bound of \(Ξ©(|S||A|t_\{\text\{mix\}\} \epsilon^\{-2\})\). In these expressions, \(|S|\) and \(|A|\) denote the cardinalities of the state and action spaces respectively, \(t_\{\text\{mix\}\}\) serves as a uniform upper limit for the total variation mixing times, and \(\epsilon\) signifies the error tolerance. Therefore, a notable gap of \(t_\{\text\{mix\}\}\) still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of \(\widetilde O(|S||A|t_\{\text\{mix\}\}\epsilon^\{-2\})\). This marks the first algorithm and anal

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keywang2023optimal

Related papers