On Proximal Policy Optimization's Heavy-tailed Gradients
2021 Β· Saurabh Garg, Joshua Zhanson, Emilio Parisotto, et al.
Abstract
Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (``heavy-tailed'') regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clip
Authors
(none)
Tags
Stats
Related papers
- Proximal Policy Optimization Algorithms (2017)0.00
- Gradient Informed Proximal Policy Optimization (2023)5.15
- Revisiting Design Choices In Proximal Policy Optimization (2020)0.00
- Truly Proximal Policy Optimization (2019)0.00
- Neural Ppo-clip Attains Global Optimality: A Hinge Loss Perspective (2021)0.00
- KIPPO: Koopman-inspired Proximal Policy Optimization (2025)0.00
- The Sufficiency Of Off-policyness And Soft Clipping: PPO Is Still Insufficient According To An Off-policy Measure (2022)9.23
- A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes (2023)0.00