Correcting Discount-factor Mismatch In On-policy Policy Gradient Methods
2023 Β· Fengdi Che, Gautham Vasan, A. Rupam Mahmood
Abstract
The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the *discounted stationary distribution*. But commonly used on-policy methods based on the policy gradient theorem ignores the discount factor in the state distribution, which is technically incorrect and may even cause degenerate learning behavior in some environments. An existing solution corrects this discrepancy by using \(\gamma^t\) as a factor in the gradient estimate. However, this solution is not widely adopted and does not work well in tasks where the later states are similar to earlier states. We introduce a novel distribution correction to account for the discounted stationary distribution that can be plugged into many existing gradient estimators. Our correction circumvents the performance degradation associated with the \(\gamma^t\) correction with a lower variance.
Authors
(none)
Tags
Stats
Related papers
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- Analysis Of On-policy Policy Gradient Methods Under The Distribution Mismatch (2025)0.00
- A Temporal-difference Approach To Policy Gradient Estimation (2022)0.00
- Entropy Regularization With Discounted Future State Distribution In Policy Gradient Methods (2019)0.00
- On The Convergence Of Discounted Policy Gradient Methods (2022)0.00
- Revisiting Estimation Bias In Policy Gradients For Deep Reinforcement Learning (2023)0.00
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00