Sample Complexity Of Asynchronous Q-learning: Sharper Analysis And Variance Reduction

Abstract

Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a \(\gamma\)-discounted MDP with state space \(\mathcal\{S\}\) and action space \(\mathcal\{A\}\), we demonstrate that the \(\ell_\{\infty\}\)-based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise \(\epsilon\)-accurate estimate of the Q-function --- is at most on the order of \(\frac\{1\}\{\mu_\{\min\}(1-\gamma)^5\epsilon^2\}+ \frac\{t_\{mix\}\}\{\mu_\{\min\}(1-\gamma)\}\) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, \(t_\{mix\}\) and \(\mu_\{\min\}\) denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with indepen

Sample Complexity Of Asynchronous Q-learning: Sharper Analysis And Variance Reduction

Abstract

Authors

Tags

Stats

Related papers