Softmax Deep Double Deterministic Policy Gradients
2020 Β· Ling Pan, Qingpeng Cai, Longbo Huang
Abstract
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators,
Authors
(none)
Tags
Stats
Related papers
- Mitigating Estimation Bias With Representation Learning In TD Error-driven Regularization (2025)0.00
- Value Activation For Bias Alleviation: Generalized-activated Deep Double Deterministic Policy Gradients (2021)0.00
- An Alternate Policy Gradient Estimator For Softmax Policies (2021)0.00
- Soft Policy Gradient Method For Maximum Entropy Deep Reinforcement Learning (2019)10.85
- Neural Replicator Dynamics (2019)0.00
- Double Actor-critic With TD Error-driven Regularization In Reinforcement Learning (2024)3.58
- Mitigating Suboptimality Of Deterministic Policy Gradients In Complex Q-functions (2024)0.00
- Improved Exploration Through Latent Trajectory Optimization In Deep Deterministic Policy Gradient (2019)0.00