A Minimal-assumption Analysis Of Q-learning With Time-varying Policies
2025 Β· Phalguni Nanda, Zaiwei Chen
Abstract
In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for \(\mathbb\{E\}[\|Q_k - Q^*\|_\infty^2]\), implying a sample complexity of order \(\mathcal\{O\}(1/\xi^2)\) for achieving \(\mathbb\{E\}[\|Q_k - Q^*\|_\infty]\le \xi\). This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for \(\mathbb\{E\}[\|Q^\{\pi_k\} - Q^*\|_\infty^2]\), where \(\pi_k\) is the learning policy at iteration \(k\), highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to a
Authors
(none)
Tags
Stats
Related papers
- Finite-time Analysis For Double Q-learning (2020)0.00
- Finite-time Analysis Of Asynchronous Q-learning Under Diminishing Step-size From Control-theoretic View (2022)3.58
- Q-learning With UCB Exploration Is Sample Efficient For Infinite-horizon MDP (2019)0.00
- Is Q-learning Minimax Optimal? A Tight Sample Complexity Analysis (2021)0.00
- Q-learning With Uniformly Bounded Variance: Large Discounting Is Not A Barrier To Fast Learning (2020)0.00
- Sample Complexity Of Asynchronous Q-learning: Sharper Analysis And Variance Reduction (2020)11.19
- From Set Convergence To Pointwise Convergence: Finite-time Guarantees For Average-reward Q-learning With Adaptive Stepsizes (2025)0.00
- Pessimistic Q-learning For Offline Reinforcement Learning: Towards Optimal Sample Complexity (2022)0.00