Zeroth-order Deterministic Policy Gradient
2020 Β· Harshat Kumar, Dionysios S. Kalogerias, George J. Pappas, et al.
Abstract
Deterministic Policy Gradient (DPG) removes a level of randomness from standard randomized-action Policy Gradient (PG), and demonstrates substantial empirical success for tackling complex dynamic problems involving Markov decision processes. At the same time, though, DPG loses its ability to learn in a model-free (i.e., actor-only) fashion, frequently necessitating the use of critics in order to obtain consistent estimates of the associated policy-reward gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates policy-reward gradients via two-point stochastic evaluations of the \(Q\)-function, constructed by properly designed low-dimensional action-space perturbations. Exploiting the idea of random horizon rollouts for obtaining unbiased estimates of the \(Q\)-function, ZDPG lifts the dependence on critics and restores true model-free policy learning, while enjoying built-in and provable algorithmic stability. Additionally, we present ne
Authors
(none)
Tags
Stats
Related papers
- Zeroth-order Supervised Policy Improvement (2020)0.00
- 3DPG: Distributed Deep Deterministic Policy Gradient Algorithms For Networked Multi-agent Systems (2022)0.00
- Expected Policy Gradients (2017)0.00
- Deterministic Policy Gradient For Reinforcement Learning With Continuous Time And State (2025)0.00
- Mitigating Suboptimality Of Deterministic Policy Gradients In Complex Q-functions (2024)0.00
- Direct Policy Gradients: Direct Optimization Of Policies In Discrete Action Spaces (2019)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm For Sparse Reward Continuous Control (2024)0.00