O\(^2\)TD: (near)-optimal Off-policy TD Learning
2017 Β· Bo Liu, Daoming Lyu, Wen Dong, et al.
Abstract
Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true value function \(V\). Two novel algorithms are proposed to approximate the true value function \(V\). This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function \(V\). (2) A linear computational cost (per step) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and off-policy stability.
Authors
(none)
Tags
Stats
Related papers
- Approximate Temporal Difference Learning Is A Gradient Descent For Reversible Policies (2018)0.00
- Gradient Iterated Temporal-difference Learning (2026)0.00
- Adaptive Temporal Difference Learning With Linear Function Approximation (2020)0.00
- Backstepping Temporal Difference Learning (2023)0.00
- Preferential Temporal Difference Learning (2021)0.00
- Differential Temporal Difference Learning (2018)5.24
- A Finite Time Analysis Of Temporal Difference Learning With Linear Function Approximation (2018)0.00
- Discerning Temporal Difference Learning (2023)0.00