Neural Temporal-difference And Q-learning Provably Converge To Global Optima
2019 Β· Qi Cai, Zhuoran Yang, Jason D. Lee, et al.
Abstract
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.
Authors
(none)
Tags
Stats
Related papers
- Simplifying Deep Temporal Difference Learning (2024)0.00
- On The Performance Of Temporal Difference Learning With Neural Networks (2023)0.00
- An Experimental Comparison Between Temporal Difference And Residual Gradient With Neural Network Approximation (2022)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- TD Convergence: An Optimization Perspective (2023)0.00
- Geometric Insights Into The Convergence Of Nonlinear TD Learning (2019)0.00
- Target-based Temporal Difference Learning (2019)0.00
- An Improved Finite-time Analysis Of Temporal Difference Learning With Deep Neural Networks (2024)0.00