Reparameterization Proximal Policy Optimization
2025 Β· Hai Zhong, Xun Wang, Zhuoran Li, et al.
Abstract
By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms
Authors
(none)
Tags
Stats
Related papers
- Proximal Policy Optimization Algorithms (2017)0.00
- Relative Entropy Pathwise Policy Optimization (2025)0.00
- PTR-PPO: Proximal Policy Optimization With Prioritized Trajectory Replay (2021)0.00
- Simple Policy Optimization (2024)0.00
- Truly Proximal Policy Optimization (2019)0.00
- KIPPO: Koopman-inspired Proximal Policy Optimization (2025)0.00
- Revisiting Design Choices In Proximal Policy Optimization (2020)0.00
- Robust And Diverse Multi-agent Learning Via Rational Policy Gradient (2025)0.00