Non-uniform Noise-to-signal Ratio In The REINFORCE Policy-gradient Estimator
2026 Β· Haoyu Han, Heng Yang
Abstract
Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD an
Authors
(none)
Tags
Stats
Related papers
- On The Convergence And Sample Efficiency Of Variance-reduced Policy Gradient Method (2021)0.00
- Gap-increasing Policy Evaluation For Efficient And Noise-tolerant Reinforcement Learning (2019)0.00
- The Reinforce Policy Gradient Algorithm Revisited (2023)0.00
- Stochastic Variance Reduction For Policy Gradient Estimation (2017)0.00
- Smoothing Policies And Safe Policy Gradients (2019)7.50
- Natural Policy Gradient For Average Reward Non-stationary RL (2025)0.00
- On The Convergence Of Policy Gradient Methods To Nash Equilibria In General Stochastic Games (2022)0.00
- S-REINFORCE: A Neuro-symbolic Policy Gradient Approach For Interpretable Reinforcement Learning (2023)0.00