Trajectory-aware Eligibility Traces For Off-policy Reinforcement Learning
2023 Β· Brett Daley, Martha White, Christopher Amato, et al.
Abstract
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditi
Authors
(none)
Tags
Stats
Related papers
- Improving The Efficiency Of Off-policy Reinforcement Learning By Accounting For Past Decisions (2021)0.00
- Expected Eligibility Traces (2020)0.00
- Meta-learning State-based Eligibility Traces For More Sample-efficient Policy Evaluation (2019)0.00
- On The Reuse Bias In Off-policy Reinforcement Learning (2022)3.58
- Recall Traces: Backtracking Models For Efficient Reinforcement Learning (2018)0.00
- Tbq(\(\sigma\)): Improving Efficiency Of Trace Utilization For Off-policy Reinforcement Learning (2019)0.00
- Meta-learning Eligibility Traces For More Sample Efficient Temporal Difference Learning (2020)0.00
- Adaptive Trade-offs In Off-policy Learning (2019)0.00