Minimax Regret Bounds For Reinforcement Learning
2017 Β· Mohammad Gheshlaghi Azar, Ian Osband, RΓ©mi Munos
Abstract
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of \(\tilde\{O\}( \sqrt\{HSAT\} + H^2S^2A+H\sqrt\{T\})\) where \(H\) is the time horizon, \(S\) the number of states, \(A\) the number of actions and \(T\) the number of time-steps. This result improves over the best previous known bound \(\tilde\{O\}(HS \sqrt\{AT\})\) achieved by the UCRL2 algorithm of Jaksch et al., 2010. The key significance of our new results is that when \(T\geq H^3S^3A\) and \(SA\geq H\), it leads to a regret of \(\tilde\{O\}(\sqrt\{HSAT\})\) that matches the established lower bound of \(Ξ©(\sqrt\{HSAT\})\) up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in \(S\)), and we define Bernstein-based "e
Authors
(none)
Tags
Stats
Related papers
- Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function (2019)0.00
- Provably Efficient Exploration In Constrained Reinforcement Learning:posterior Sampling Is All You Need (2023)0.00
- Near-optimal Optimistic Reinforcement Learning Using Empirical Bernstein Inequalities (2019)0.00
- Breaking The Sample Complexity Barrier To Regret-optimal Model-free Reinforcement Learning (2021)0.00
- Optimistic Posterior Sampling For Reinforcement Learning With Few Samples And Tight Guarantees (2022)0.00
- Posterior Sampling For Reinforcement Learning: Worst-case Regret Bounds (2017)0.00
- Regret Bounds For Reinforcement Learning Via Markov Chain Concentration (2018)9.03
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00