Relative Entropy Pathwise Policy Optimization
2025 Β· Claas Voelcker, Axel Brunnbauer, Marcel Hussing, et al.
Abstract
Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that
Authors
(none)
Tags
Stats
Related papers
- Proximal Policy Optimization Algorithms (2017)0.00
- Reparameterization Proximal Policy Optimization (2025)0.00
- Policy Optimization With Model-based Explorations (2018)5.84
- Near-future Policy Optimization (2026)0.00
- PTR-PPO: Proximal Policy Optimization With Prioritized Trajectory Replay (2021)0.00
- Logistic Q-learning (2020)0.00
- Simple Policy Optimization (2024)0.00
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00