Improving The Efficiency Of Off-policy Reinforcement Learning By Accounting For Past Decisions
2021 Β· Brett Daley, Christopher Amato
Abstract
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating ("cutting") the ratios ("traces") to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning. In the interest of motivating efficient off-policy algorithms, we propose a multistep
Authors
(none)
Tags
Stats
Related papers
- Trajectory-aware Eligibility Traces For Off-policy Reinforcement Learning (2023)0.00
- On The Reuse Bias In Off-policy Reinforcement Learning (2022)3.58
- Handling Cost And Constraints With Off-policy Deep Reinforcement Learning (2023)0.00
- Learning Expected Emphatic Traces For Deep RL (2021)5.24
- Regret Minimization Experience Replay In Off-policy Reinforcement Learning (2021)0.00
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Mitigating Off-policy Bias In Actor-critic Methods With One-step Q-learning: A Novel Correction Approach (2022)0.00
- CUER: Corrected Uniform Experience Replay For Off-policy Continuous Deep Reinforcement Learning Algorithms (2024)0.00