Abstract

This work presents the first finite-time analysis for the last-iterate convergence of average-reward \(Q\)-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that, under appropriate assumptions, the iterates generated by this \(Q\)-learning algorithm converge at a rate of \(\tilde\{\mathcal\{O\}\}(1/k)\) (in the mean-square sense) to the optimal \(Q\)-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to the centered optimal \(Q\)-function, also at a rate of \(\tilde\{\mathcal\{O\}\}(1/k)\). To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keychen2025from

Related papers