Neural Proximal/trust Region Policy Optimization Attains Globally Optimal Policy
2019 Β· Boyi Liu, Qi Cai, Zhuoran Yang, et al.
Abstract
Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate.
Authors
(none)
Tags
Stats
Related papers
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Truly Proximal Policy Optimization (2019)0.00
- Simple Policy Optimization (2024)0.00
- Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization (2018)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Neural Ppo-clip Attains Global Optimality: A Hinge Loss Perspective (2021)0.00
- A Novel Framework For Policy Mirror Descent With General Parameterization And Linear Convergence (2023)2.26
- Neural Policy Gradient Methods: Global Optimality And Rates Of Convergence (2019)0.00