Learning Expected Emphatic Traces For Deep RL
2021 Β· Ray Jiang, Shangtong Zhang, Veronica Chelu, et al.
Abstract
Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed \(n\)-step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees. We tested
Authors
(none)
Tags
Stats
Related papers
- Off-policy Reinforcement Learning With Loss Function Weighted By Temporal Difference Error (2022)2.26
- Truncated Emphatic Temporal Difference Methods For Prediction And Control (2021)0.00
- Stratified Experience Replay: Correcting Multiplicity Bias In Off-policy Reinforcement Learning (2021)0.00
- Discerning Temporal Difference Learning (2023)0.00
- Improving The Efficiency Of Off-policy Reinforcement Learning By Accounting For Past Decisions (2021)0.00
- Finite-time Analysis Of Temporal Difference Learning With Experience Replay (2023)0.00
- Simplifying Deep Temporal Difference Learning (2024)0.00
- Deep Reinforcement Learning And The Deadly Triad (2018)0.00