Abstract

In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with *nonlinear* parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is *reversible*. This holds even with nonlinear approximations. A policy with transition probabilities \(P(s,s')\) between states is reversible if there exists a function \(\mu\) over states such that \(\frac\{P(s,s')\}

Authors

(none)

Tags

  • Policy Gradient

Stats

Related papers