You May Not Need Ratio Clipping In PPO
2022 Β· Mingfei Sun, Vitaly Kurin, Guoqing Liu, et al.
Abstract
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. Ratio clipping yields a pessimistic estimate of the original surrogate objective, and has been shown to be crucial for strong performance. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. Instead, one can directly optimize the original surrogate objective for multiple epochs; the key is to find a proper condition to early stop the optimization epoch in each iteration. Our theoretical analysis sheds light on how to determine when to stop the optimization epoch, and call the resulting algorithm Early Stopping Policy Optimization (ESPO). We compare ESPO with PPO across many continuous control tasks and sho
Authors
(none)
Tags
Stats
Related papers
- Truly Proximal Policy Optimization (2019)0.00
- The Sufficiency Of Off-policyness And Soft Clipping: PPO Is Still Insufficient According To An Off-policy Measure (2022)9.23
- Simple Policy Optimization (2024)0.00
- Cim-ppo:proximal Policy Optimization With Liu-correntropy Induced Metric (2021)0.00
- Proximal Policy Optimization With Relative Pearson Divergence (2020)6.77
- PPO In The Fisher-rao Geometry (2025)0.00
- Revisiting Design Choices In Proximal Policy Optimization (2020)0.00
- Proximal Policy Optimization Via Enhanced Exploration Efficiency (2020)13.70