A Temporal-difference Approach To Policy Gradient Estimation
2022 Β· Samuele Tosatto, Andrew Patterson, Martha White, et al.
Abstract
The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achi
Authors
(none)
Tags
Stats
Related papers
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00
- Approximate Temporal Difference Learning Is A Gradient Descent For Reversible Policies (2018)0.00
- Entropy Regularization With Discounted Future State Distribution In Policy Gradient Methods (2019)0.00
- Analysis Of On-policy Policy Gradient Methods Under The Distribution Mismatch (2025)0.00
- Approximate Discounting-free Policy Evaluation From Transient And Recurrent States (2022)0.00
- On The Convergence Of Discounted Policy Gradient Methods (2022)0.00
- Policy Gradient In Partially Observable Environments: Approximation And Convergence (2018)0.00