Optimal Estimation Of Off-policy Policy Gradient Via Double Fitted Iteration
2022 Β· Chengzhuo Ni, Ruiqi Zhang, Xiang Ji, et al.
Abstract
Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimat
Authors
(none)
Tags
Stats
Related papers
- Stochastic Policy Gradient Methods: Improved Sample Complexity For Fisher-non-degenerate Policies (2023)0.00
- Factored Policy Gradients: Leveraging Structure For Efficient Learning In Momdps (2021)0.00
- An Alternate Policy Gradient Estimator For Softmax Policies (2021)0.00
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- Efficiently Escaping Saddle Points For Policy Optimization (2023)0.00
- PC-PG: Policy Cover Directed Exploration For Provable Policy Gradient Learning (2020)0.00
- Smoothing Policies And Safe Policy Gradients (2019)7.50
- PAGE-PG: A Simple And Loopless Variance-reduced Policy Gradient Method With Probabilistic Gradient Estimation (2022)0.00