Moments Matter:stabilizing Policy Optimization Using Return Distributions
2026 Β· Dennis Jabs, Aditya Mohan, Marius Lindauer
Abstract
Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution \(R(\theta)\), obtained by repeatedly sampling minibatches, updating \(\theta\), and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow \(R(\theta)\) can improve stability, directly estimating \(R(\theta)\) is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental s
Authors
(none)
Tags
Stats
Related papers
- Policy Optimization In A Noisy Neighborhood: On Return Landscapes In Continuous Control (2023)0.00
- Regularizing A Model-based Policy Stationary Distribution To Stabilize Offline Reinforcement Learning (2022)0.00
- Never Worse, Mostly Better: Stable Policy Improvement In Deep Reinforcement Learning (2019)0.00
- Beyond Expected Return: Accounting For Policy Reproducibility When Evaluating Reinforcement Learning Algorithms (2023)3.58
- Model-agnostic Solutions For Deep Reinforcement Learning In Non-ergodic Contexts (2026)0.00
- Robust Adversarial Policy Optimization Under Dynamics Uncertainty (2026)0.00
- A Risk-sensitive Approach To Policy Optimization (2022)3.58
- Learning To Stabilize Online Reinforcement Learning In Unbounded State Spaces (2023)0.00