Abstract

Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a \(\gamma\)-discounted MDP with state space \(\mathcal\{S\}\) and action space \(\mathcal\{A\}\), we demonstrate that the \(\ell_\{\infty\}\)-based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise \(\epsilon\)-accurate estimate of the Q-function --- is at most on the order of \(\frac\{1\}\{\mu_\{\min\}(1-\gamma)^5\epsilon^2\}+ \frac\{t_\{mix\}\}\{\mu_\{\min\}(1-\gamma)\}\) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, \(t_\{mix\}\) and \(\mu_\{\min\}\) denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with indepen

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations30
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score11.19
  • arxiv keyli2020sample

Related papers