On Generalized Bellman Equations And Temporal-difference Learning
2017 Β· Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton
Abstract
We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the \(\lambda\)-parameters of TD, based on generalized Bellman equations. Our scheme is to set \(\lambda\) according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger \(\lambda\) values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evol
Authors
(none)
Tags
Stats
Related papers
- A Finite Time Analysis Of Temporal Difference Learning With Linear Function Approximation (2018)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- Meta-learning Eligibility Traces For More Sample Efficient Temporal Difference Learning (2020)0.00
- Preferential Temporal Difference Learning (2021)0.00
- Finite-time Performance Of Distributed Temporal Difference Learning With Linear Function Approximation (2019)9.59
- Discerning Temporal Difference Learning (2023)0.00
- Meta-learning State-based Eligibility Traces For More Sample-efficient Policy Evaluation (2019)0.00
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00