Value-biased Maximum Likelihood Estimation For Model-based Reinforcement Learning In Discounted Linear Mdps
2023 Β· Yu-Heng Hung, Ping-Chun Hsieh, Akshay Mete, et al.
Abstract
We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature mapping. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a large number of optimization runs in each time step, especially when the state and action spaces are large. To address this issue, we propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE), which is a classic model-based exploration principle in the adaptive control literature for resolving the well-known closed-loop identification problem of Maximum Likelihood Estimation. We formally show that (i) VBMLE enjoys \(\widetilde\{O\}(d\sqrt\{T\})\) regret, where \(T\) is the time horizon and \(d\) is the dimension of the model parameter, and (ii) VBMLE i
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- Nearly Minimax Optimal Reinforcement Learning For Linear Markov Decision Processes (2022)0.00
- Provably Efficient Reinforcement Learning For Discounted Mdps With Feature Mapping (2020)0.00
- Regret-optimal Model-free Reinforcement Learning For Discounted Mdps With Short Burn-in Time (2023)0.00
- Variance-aware Regret Bounds For Undiscounted Reinforcement Learning In Mdps (2018)0.00
- Efficient Learning In Non-stationary Linear Markov Decision Processes (2020)6.77
- Model-based Reinforcement Learning With Multinomial Logistic Function Approximation (2022)2.26
- Logarithmic Regret Bounds For Continuous-time Average-reward Markov Decision Processes (2022)5.24