Abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for \(\mathbb\{E\}[\|Q_k - Q^*\|_\infty^2]\), implying a sample complexity of order \(\mathcal\{O\}(1/\xi^2)\) for achieving \(\mathbb\{E\}[\|Q_k - Q^*\|_\infty]\le \xi\). This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for \(\mathbb\{E\}[\|Q^\{\pi_k\} - Q^*\|_\infty^2]\), where \(\pi_k\) is the learning policy at iteration \(k\), highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to a

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keynanda2025a

Related papers