On The Convergence Of Experience Replay In Policy Optimization: Characterizing Bias, Variance, And Finite-time Convergence
2021 Β· Hua Zheng, Wei Xie, M. Ben Feng
Abstract
Experience replay is a core ingredient of modern deep reinforcement learning, yet its benefits in policy optimization are poorly understood beyond empirical heuristics. This paper develops a novel theoretical framework for experience replay in modern policy gradient methods, where two sources of dependence fundamentally complicate analysis: Markovian correlations along trajectories and policy drift across optimization iterations. We introduce a new proof technique based on auxiliary Markov chains and lag-based decoupling that makes these dependencies tractable. Within this framework, we derive finite-time bias bounds for policy-gradient estimators under replay, identifying how bias scales with the cumulative policy update, the mixing time of the underlying dynamics, and the age of buffered data, thereby formalizing the practitioner's rule of avoiding overly stale replay. We further provide a correlation-aware variance decomposition showing how sample dependence governs gradient varianc
Authors
(none)
Tags
Stats
Related papers
- Replay For Safety (2021)0.00
- CUER: Corrected Uniform Experience Replay For Off-policy Continuous Deep Reinforcement Learning Algorithms (2024)0.00
- Adaptive Experience Selection For Policy Gradient (2020)0.00
- Off-policy Correction For Deep Deterministic Policy Gradient Algorithms Via Batch Prioritized Experience Replay (2021)0.00
- Stratified Experience Replay: Correcting Multiplicity Bias In Off-policy Reinforcement Learning (2021)0.00
- Safe And Robust Experience Sharing For Deterministic Policy Gradient Algorithms (2022)0.00
- Variance Reduction Based Partial Trajectory Reuse To Accelerate Policy Gradient Optimization (2022)0.00
- Remember And Forget For Experience Replay (2018)0.00