Reducing Variance In Temporal-difference Value Estimation Via Ensemble Of Deep Networks
2022 Β· Litian Liang, Yaosheng Xu, Stephen McAleer, et al.
Abstract
In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimation variance as the root cause of overestimation. In this paper, we propose MeanQ, a simple ensemble method that estimates target values as ensemble means. Despite its simplicity, MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark. Importantly, we find that an ensemble of size 5 sufficiently reduces estimation variance to obviate the lagging target network, eliminating it as a source of bias and further gaining sample efficiency. We justify intuitively and empirically the design choices in MeanQ, including the necessity of independent experience sampling. On a set of 26 benchmark Atari environmen
Authors
(none)
Tags
Stats
Related papers
- An Information-theoretic Optimality Principle For Deep Reinforcement Learning (2017)0.00
- On The Estimation Bias In Double Q-learning (2021)0.00
- Parameter-free Reduction Of The Estimation Bias In Deep Reinforcement Learning For Deterministic Policy Gradients (2021)0.00
- The Statistical Benefits Of Quantile Temporal-difference Learning For Value Estimation (2023)0.00
- WD3: Taming The Estimation Bias In Deep Reinforcement Learning (2020)10.21
- Aggressive Q-learning With Ensembles: Achieving Both High Sample Efficiency And High Asymptotic Performance (2021)0.00
- Target-based Temporal Difference Learning (2019)0.00
- On The Reduction Of Variance And Overestimation Of Deep Q-learning (2019)0.00