Policy Search By Target Distribution Learning For Continuous Control
2019 Β· Chuheng Zhang, Yuanqi Li, Jian Li
Abstract
We observe that several existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic (even in some very simple environments), leading to an unstable training process. To address this issue, we propose a new method, called *target distribution learning* (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.
Authors
(none)
Tags
Stats
Related papers
- Distributional Policy Optimization: An Alternative Approach For Continuous Control (2019)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- Proximal Policy Optimization With Continuous Bounded Action Space Via The Beta Distribution (2021)0.00
- Policy Optimization In A Noisy Neighborhood: On Return Landscapes In Continuous Control (2023)0.00
- Bayesian Policy Gradients Via Alpha Divergence Dropout Inference (2017)0.00
- Deterministic Policy Gradient For Reinforcement Learning With Continuous Time And State (2025)0.00
- Global Convergence Using Policy Gradient Methods For Model-free Markovian Jump Linear Quadratic Control (2021)0.00
- Categorical Policies: Multimodal Policy Learning And Exploration In Continuous Control (2025)0.00