Adaptive Sampling For Best Policy Identification In Markov Decision Processes
2020 Β· Aymen Al Marjani, Alexandre Proutiere
Abstract
We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic
Authors
(none)
Tags
Stats
Related papers
- Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity (2020)0.00
- Optimal Sample Complexity For Average Reward Markov Decision Processes (2023)0.00
- Near Sample-optimal Reduction-based Policy Learning For Average Reward MDP (2022)0.00
- Instance-dependent Near-optimal Policy Identification In Linear Mdps Via Online Experiment Design (2022)0.00
- Sample-efficient Reinforcement Learning For Linearly-parameterized Mdps With A Generative Model (2021)0.00
- Robust Batch Policy Learning In Markov Decision Processes (2020)0.00
- Bayesian Learning Of Optimal Policies In Markov Decision Processes With Countably Infinite State-space (2023)0.00
- Value-biased Maximum Likelihood Estimation For Model-based Reinforcement Learning In Discounted Linear Mdps (2023)0.00