Provably Efficient And Agile Randomized Q-learning

Abstract

While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning process. In this paper, we propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL. We establish an \(\widetilde\{O\}(\sqrt\{H^5SAT\})\) regret bound, where \(S\) is the number of states, \(A\) is the number of actions, \(H\) is the episode length, and \(T\) is the total number of episodes. In addition, we present a logarithmic regret bound under a mild positive sub-optimality condition on the optimal Q-function. Empirically, RandomizedQ exhibits outstanding performanc

Provably Efficient And Agile Randomized Q-learning

Abstract

Authors

Tags

Stats

Related papers