The Reinforce Policy Gradient Algorithm Revisited
2023 Β· Shalabh Bhatnagar
Abstract
We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm. We estimate the policy gradient using a function measurement over a perturbed parameter by appealing to a class of random search approaches. This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm. Nonetheless, we observe that even though we estimate the gradient of the performance objective using the performance objective itself (and not via the sample gradient), the algorithm converges to a neighborhood of a local minimum. We also provide
Authors
(none)
Tags
Stats
Related papers
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Policy Gradient For Continuing Tasks In Non-stationary Markov Decision Processes (2020)0.00
- Policy Gradient Using Weak Derivatives For Reinforcement Learning (2020)0.00
- Stabilizing Policy Gradient Methods Via Reward Profiling (2025)0.00
- All-action Policy Gradient Methods: A Numerical Integration Approach (2019)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Residual Policy Gradient: A Reward View Of Kl-regularized Objective (2025)0.00
- Policy Gradient Algorithms Implicitly Optimize By Continuation (2023)0.00