Model-agnostic Solutions For Deep Reinforcement Learning In Non-ergodic Contexts
2026 Β· Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis
Abstract
Reinforcement Learning (RL) remains a central optimisation framework in machine learning. Although RL agents can converge to optimal solutions, the definition of ``optimality'' depends on the environment's statistical properties. The Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards. However, when ergodicity is broken, long-term outcomes depend on the specific trajectory rather than on the ensemble average. In such settings, the ensemble average diverges from the time-average growth experienced by individual agents, with expected-value formulations yielding systematically suboptimal policies. Prior studies demonstrated that traditional RL architectures fail to recover the true optimum in non-ergodic environments. We extend this analysis to deep RL implementations and show that these, too, produce suboptimal policies under non-ergodic dynamics. Introducing explicit time dependence into the learning process can correct this limit
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning With Non-ergodic Reward Increments: Robustness Via Ergodicity Transformations (2023)0.00
- Demystifying Reinforcement Learning In Time-varying Systems (2022)0.00
- Moments Matter:stabilizing Policy Optimization Using Return Distributions (2026)0.00
- Never Worse, Mostly Better: Stable Policy Improvement In Deep Reinforcement Learning (2019)0.00
- The Effective Horizon Explains Deep RL Performance In Stochastic Environments (2023)3.42
- Online Reinforcement Learning In Non-stationary Context-driven Environments (2023)0.00
- Delayed Geometric Discounts: An Alternative Criterion For Reinforcement Learning (2022)0.00
- Transient Non-stationarity And Generalisation In Deep Reinforcement Learning (2020)0.00