Online Off-policy Prediction
2018 Β· Sina Ghiassian, Andrew Patterson, Martha White, et al.
Abstract
This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving, represented as a value function. However, the behavior used to select actions and generate the behavior data might be different from the one used to define the predictions, and thus the samples are generated off-policy. The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades. The issue lies with the temporal difference (TD) learning update at the heart of most prediction algorithms: combining bootstrapping, off-policy sampling and function approximation may cause the value estimate to diverge. A breakthrough came with the development of a new objective function that admitted stochastic gradient descent v
Authors
(none)
Tags
Stats
Related papers
- Doubly Robust Off-policy Value And Gradient Estimation For Deterministic Policies (2020)0.00
- O\(^2\)TD: (near)-optimal Off-policy TD Learning (2017)0.00
- Off-policy Policy Gradient Algorithms By Constraining The State Distribution Shift (2019)0.00
- Towards Robust Off-policy Learning For Runtime Uncertainty (2022)0.00
- On A Convergent Off -policy Temporal Difference Learning Algorithm In On-line Learning Environment (2016)0.00
- Learning Off-policy With Model-based Intrinsic Motivation For Active Online Exploration (2024)0.00
- Debiased Offline Representation Learning For Fast Online Adaptation In Non-stationary Dynamics (2024)0.00
- Stochastic Gradient Descent With Dependent Data For Offline Reinforcement Learning (2022)0.00