Abstract

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of \(\tilde\{O\}( \sqrt\{HSAT\} + H^2S^2A+H\sqrt\{T\})\) where \(H\) is the time horizon, \(S\) the number of states, \(A\) the number of actions and \(T\) the number of time-steps. This result improves over the best previous known bound \(\tilde\{O\}(HS \sqrt\{AT\})\) achieved by the UCRL2 algorithm of Jaksch et al., 2010. The key significance of our new results is that when \(T\geq H^3S^3A\) and \(SA\geq H\), it leads to a regret of \(\tilde\{O\}(\sqrt\{HSAT\})\) that matches the established lower bound of \(Ξ©(\sqrt\{HSAT\})\) up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in \(S\)), and we define Bernstein-based "e

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyazar2017minimax

Related papers