Policy Gradient Using Weak Derivatives For Reinforcement Learning
2020 Β· Sujay Bhatt, Alec Koppel, Vikram Krishnamurthy
Abstract
This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be \(O(1/\sqrt(k))\); (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on
Authors
(none)
Tags
Stats
Related papers
- An Analysis Of Measure-valued Derivatives For Policy Gradients (2022)2.26
- An Empirical Analysis Of Measure-valued Derivatives For Policy Gradients (2021)0.00
- Deterministic Policy Gradient For Reinforcement Learning With Continuous Time And State (2025)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- Variational Policy Gradient Method For Reinforcement Learning With General Utilities (2020)0.00
- Convergence And Optimality Of Policy Gradient Methods In Weakly Smooth Settings (2021)3.58
- The Reinforce Policy Gradient Algorithm Revisited (2023)0.00
- Policy Gradient Algorithms Implicitly Optimize By Continuation (2023)0.00