Tbq(\(\sigma\)): Improving Efficiency Of Trace Utilization For Off-policy Reinforcement Learning
2019 Β· Longxiang Shi, Shijian Li, Longbing Cao, et al.
Abstract
Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q(\(\lambda\)) and Naive Q(\(\lambda\)) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(\(\sigma\)), which effectively unifies the tree-backup algorithm and Naive Q(\(\lambda\)). By introducing a new param
Authors
(none)
Tags
Stats
Related papers
- A Unified Approach For Multi-step Temporal-difference Learning With Eligibility Traces In Reinforcement Learning (2018)6.77
- Meta-learning State-based Eligibility Traces For More Sample-efficient Policy Evaluation (2019)0.00
- Trajectory-aware Eligibility Traces For Off-policy Reinforcement Learning (2023)0.00
- Improving The Efficiency Of Off-policy Reinforcement Learning By Accounting For Past Decisions (2021)0.00
- Recall Traces: Backtracking Models For Efficient Reinforcement Learning (2018)0.00
- Meta-learning Eligibility Traces For More Sample Efficient Temporal Difference Learning (2020)0.00
- Multi-step Reinforcement Learning: A Unifying Algorithm (2017)12.68
- Constrained Policy Improvement For Safe And Efficient Reinforcement Learning (2018)0.00