Preferential Temporal Difference Learning
2021 Β· Nishanth Anand, Doina Precup
Abstract
Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function app
Authors
(none)
Tags
Stats
Related papers
- Discerning Temporal Difference Learning (2023)0.00
- Approximate Temporal Difference Learning Is A Gradient Descent For Reversible Policies (2018)0.00
- On The Statistical Benefits Of Temporal Difference Learning (2023)0.00
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- Differential Temporal Difference Learning (2018)5.24
- Gradient Iterated Temporal-difference Learning (2026)0.00
- Meta-learning Eligibility Traces For More Sample Efficient Temporal Difference Learning (2020)0.00