Efficiently Breaking The Curse Of Horizon In Off-policy Evaluation With Double Reinforcement Learning
2019 Β· Nathan Kallus, Masatoshi Uehara
Abstract
Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density rat
Authors
(none)
Tags
Stats
Related papers
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Off-policy Evaluation In Doubly Inhomogeneous Environments (2023)7.16
- More Efficient Off-policy Evaluation Through Regularized Targeted Learning (2019)0.00
- Black-box Off-policy Estimation For Infinite-horizon Reinforcement Learning (2020)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- Near-optimal Provable Uniform Convergence In Offline Policy Evaluation For Reinforcement Learning (2020)0.00