Compatible Natural Gradient Policy Search
2019 Β· Joni Pajarinen, Hong Linh Thai, Riad Akrour, et al.
Abstract
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
Authors
(none)
Tags
Stats
Related papers
- Fast Global Convergence Of Natural Policy Gradient Methods With Entropy Regularization (2020)0.00
- Natural Policy Gradients In Reinforcement Learning Explained (2022)0.00
- Policy Search By Target Distribution Learning For Continuous Control (2019)3.58
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Stochastic Variance Reduction For Policy Gradient Estimation (2017)0.00
- Policy Gradient Using Weak Derivatives For Reinforcement Learning (2020)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- An Analytical Update Rule For General Policy Optimization (2021)0.00