Approximate Temporal Difference Learning Is A Gradient Descent For Reversible Policies
2018 Β· Yann Ollivier
Abstract
In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with *nonlinear* parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is *reversible*. This holds even with nonlinear approximations. A policy with transition probabilities \(P(s,s')\) between states is reversible if there exists a function \(\mu\) over states such that \(\frac\{P(s,s')\}
Authors
(none)
Tags
Stats
Related papers
- Preferential Temporal Difference Learning (2021)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- Discerning Temporal Difference Learning (2023)0.00
- Backstepping Temporal Difference Learning (2023)0.00
- A Finite Time Analysis Of Temporal Difference Learning With Linear Function Approximation (2018)0.00
- O\(^2\)TD: (near)-optimal Off-policy TD Learning (2017)0.00
- Differential Temporal Difference Learning (2018)5.24
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00