A Minimal-assumption Analysis Of Q-learning With Time-varying Policies

Abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for \(\mathbb\{E\}[\|Q_k - Q^*\|_\infty^2]\), implying a sample complexity of order \(\mathcal\{O\}(1/\xi^2)\) for achieving \(\mathbb\{E\}[\|Q_k - Q^*\|_\infty]\le \xi\). This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for \(\mathbb\{E\}[\|Q^\{\pi_k\} - Q^*\|_\infty^2]\), where \(\pi_k\) is the learning policy at iteration \(k\), highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to a

A Minimal-assumption Analysis Of Q-learning With Time-varying Policies

Abstract

Authors

Tags

Stats

Related papers