Learning Successor States And Goal-dependent Values: A Mathematical Viewpoint
2021 · Léonard Blier, Corentin Tallec, Yann Ollivier
Abstract
In reinforcement learning, temporal difference-based algorithms can be sample-inefficient: for instance, with sparse rewards, no learning occurs until a reward is observed. This can be remedied by learning richer objects, such as a model of the environment, or successor states. Successor states model the expected future state occupancy from any given state for a given policy and are related to goal-dependent value functions, which learn how to reach arbitrary states. We formally derive the temporal difference algorithm for successor state and goal-dependent value function learning, either for discrete or for continuous environments with function approximation. Especially, we provide finite-variance estimators even in continuous environments, where the reward for exactly reaching a goal state becomes infinitely sparse. Successor states satisfy more than just the Bellman equation: a backward Bellman operator and a Bellman-Newton (BN) operator encode path compositionality in the environme
Authors
(none)
Tags
Stats
Related papers
- Differential Temporal Difference Learning (2018)5.24
- Preferential Temporal Difference Learning (2021)0.00
- Loss Dynamics Of Temporal Difference Reinforcement Learning (2023)0.00
- Approximate Temporal Difference Learning Is A Gradient Descent For Reversible Policies (2018)0.00
- Temporal Difference Learning With Continuous Time And State In The Stochastic Setting (2022)0.00
- On The Statistical Benefits Of Temporal Difference Learning (2023)0.00
- Successor Features Combine Elements Of Model-free And Model-based Reinforcement Learning (2019)0.00
- An Analysis Of Action-value Temporal-difference Methods That Learn State Values (2025)0.00