Q-learning With UCB Exploration Is Sample Efficient For Infinite-horizon MDP
2019 Β· Kefan Dong, Yuanhao Wang, Xiaoyu Chen, et al.
Abstract
A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite\{jin2018q\} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards *without* accessing a generative model. We show that the \textit\{sample complexity of exploration\} of our algorithm is bounded by \(\tilde\{O\}(\{\frac\{SA\}\{\epsilon^2(1-\gamma)^7\}\})\). This improves the previously best known result of \(\tilde\{O\}(\{\frac\{SA\}\{\epsilon^4(1-\gamma)^8\}\})\) in this setting achieved by delayed Q-learning \cite\{strehl2006pac\}, and matches the lower bound in terms of \(\epsilon\) as well as \(S\) and \(A\) except for logarithmic factors.
Authors
(none)
Tags
Stats
Related papers
- Provably Efficient And Agile Randomized Q-learning (2025)0.00
- Provably Efficient Q-learning With Low Switching Cost (2019)0.00
- Provably Efficient Exploration In Constrained Reinforcement Learning:posterior Sampling Is All You Need (2023)0.00
- Q-learning With Uniformly Bounded Variance: Large Discounting Is Not A Barrier To Fast Learning (2020)0.00
- A Model-free Learning Algorithm For Infinite-horizon Average-reward Mdps With Near-optimal Regret (2020)0.00
- Improved Bounds For Reward-agnostic And Reward-free Exploration (2026)0.00
- A Nearly Optimal And Low-switching Algorithm For Reinforcement Learning With General Function Approximation (2023)0.00
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00