Abstract

We present an algorithm based on the *Optimism in the Face of Uncertainty* (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function \(h^\{*\}\), the proposed algorithm achieves a regret bound of \(\tilde\{O\}(\sqrt\{SAHT\})\)\footnote\{The symbol \(\tilde\{O\}\) means \(O\) with log factors ignored. \} for MDP with \(S\) states and \(A\) actions, in the case that an upper bound \(H\) on the span of \(h^\{*\}\), i.e., \(sp(h^\{*\})\) is known. This result outperforms the best previous regret bounds \(\tilde\{O\}(S\sqrt\{AHT\}) \)\citep\{fruit2019improved\} by a factor of \(\sqrt\{S\}\). Furthermore, this regret bound matches the lower bound of \(Ξ©(\sqrt\{SAHT\}) \)\citep\{jaksch2010near\} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of \(\tilde\{O\}(\sqrt\{SADT\})\) for MDPs

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyzhang2019regret

Related papers