\(\sqrt{n}\)-regret For Learning In Markov Decision Processes With Function Approximation And Low Bellman Rank
2019 Β· Kefan Dong, Jian Peng, Yining Wang, et al.
Abstract
In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, t
Authors
(none)
Tags
Stats
Related papers
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Online Convex Optimization In Adversarial Markov Decision Processes (2019)0.00
- Logarithmic Regret Of Exploration In Average Reward Markov Decision Processes (2025)0.00
- Value-biased Maximum Likelihood Estimation For Model-based Reinforcement Learning In Discounted Linear Mdps (2023)0.00
- Adaptive Approximate Policy Iteration (2020)0.00
- Efficient Learning In Non-stationary Linear Markov Decision Processes (2020)6.77