Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of \(\tilde\{O\}(DS\sqrt\{AT\})\) for any communicating MDP with \(S\) states, \(A\) actions and diameter \(D\). Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon \(T\). This result closely matches the known lower bound of \(Ξ©(\sqrt\{DSAT\})\). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers