Tightening Exploration In Upper Confidence Reinforcement Learning
2020 Β· Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
Abstract
The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore, to tighten exploration, it uses an adaptive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over dist
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Optimistic Reinforcement Learning Using Empirical Bernstein Inequalities (2019)0.00
- Non-stationary Reinforcement Learning: The Blessing Of (more) Optimism (2019)0.00
- Minimax Regret Bounds For Reinforcement Learning (2017)0.00
- Anti-concentrated Confidence Bonuses For Scalable Exploration (2021)0.00
- Fundamental Limits Of Reinforcement Learning In Environment With Endogeneous And Exogeneous Uncertainty (2021)0.00
- Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning (2021)0.00
- Uncertainty Quantification And Exploration For Reinforcement Learning (2019)6.77
- Conservative Exploration In Reinforcement Learning (2020)0.00