Abstract

In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process \(M\), our algorithm has regret \(K(M) log(T) + \mathrm\{o\}(log(T))\) where \(T\) is the number of learning steps and \(K(M)\) is the best possible constant. This algorithm works by explicitly tracking the constant \(K(M)\) to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function \(K(M)\) is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate \(K(M)\) with arbitrary precision from empirical data.

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers