An Analysis Of Action-value Temporal-difference Methods That Learn State Values
2025 Β· Brett Daley, Prabhat Nagarajan, Martha White, et al.
Abstract
The hallmark feature of temporal-difference (TD) learning is bootstrapping: using value predictions to generate new value predictions. The vast majority of TD methods for control learn a policy by bootstrapping from a single action-value function (e.g., Q-learning and Sarsa). Significantly less attention has been given to methods that bootstrap from two asymmetric value functions: i.e., methods that learn state values as an intermediate step in learning action values. Existing algorithms in this vein can be categorized as either QV-learning or AV-learning. Though these algorithms have been investigated to some degree in prior work, it remains unclear if and when it is advantageous to learn two value functions instead of just one -- and whether such approaches are theoretically sound in general. In this paper, we analyze these algorithmic families in terms of convergence and sample efficiency. We find that while both families are more efficient than Expected Sarsa in the prediction sett
Authors
(none)
Tags
Stats
Related papers
- On The Statistical Benefits Of Temporal Difference Learning (2023)0.00
- Preferential Temporal Difference Learning (2021)0.00
- Gradient Iterated Temporal-difference Learning (2026)0.00
- Differential Temporal Difference Learning (2018)5.24
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00
- A Generalized Bootstrap Target For Value-learning, Efficiently Combining Value And Feature Predictions (2022)0.00
- Prediction And Control In Continual Reinforcement Learning (2023)0.00
- Fixed-horizon Temporal Difference Methods For Stable Reinforcement Learning (2019)0.00