Revisiting Estimation Bias In Policy Gradients For Deep Reinforcement Learning
2023 Β· Haoxuan Pan, Deheng Ye, Xiaoming Duan, et al.
Abstract
We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the
Authors
(none)
Tags
Stats
Related papers
- Parameter-free Reduction Of The Estimation Bias In Deep Reinforcement Learning For Deterministic Policy Gradients (2021)0.00
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00
- Merging Deterministic Policy Gradient Estimations With Varied Bias-variance Tradeoff For Effective Deep Reinforcement Learning (2019)0.00
- Analysis Of On-policy Policy Gradient Methods Under The Distribution Mismatch (2025)0.00
- Deterministic Policy Gradient For Reinforcement Learning With Continuous Time And State (2025)0.00
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00
- WD3: Taming The Estimation Bias In Deep Reinforcement Learning (2020)10.21
- An Empirical Analysis Of Measure-valued Derivatives For Policy Gradients (2021)0.00