Simplifying Deep Temporal Difference Learning
2024 Β· Matteo Gallici, Mattie Fellows, Benjamin Ellis, et al.
Abstract
Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a large replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the large replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify off-policy TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network or replay buffer, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises trainin
Authors
(none)
Tags
Stats
Related papers
- Target-based Temporal Difference Learning (2019)0.00
- An Improved Finite-time Analysis Of Temporal Difference Learning With Deep Neural Networks (2024)0.00
- Neural Temporal-difference And Q-learning Provably Converge To Global Optima (2019)7.81
- Discerning Temporal Difference Learning (2023)0.00
- Backstepping Temporal Difference Learning (2023)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- Gradient Temporal-difference Learning With Regularized Corrections (2020)0.00
- A Finite Time Analysis Of Temporal Difference Learning With Linear Function Approximation (2018)0.00