Asymptotically Efficient Off-policy Evaluation For Tabular Reinforcement Learning
2020 Β· Ming Yin, Yu-Xiang Wang
Abstract
We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy \(\pi\) using offline data collected by running a logging policy \(\mu\). Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon \(H\), which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order \(O(H^3/ n)\) in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of \(H\) away from a Cramer-Rao lower bound of order \(Ξ©(H^2/n)\). In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower
Authors
(none)
Tags
Stats
Related papers
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data (2023)0.00
- Black-box Off-policy Estimation For Infinite-horizon Reinforcement Learning (2020)0.00
- Nearly Horizon-free Offline Reinforcement Learning (2021)0.00
- Efficiently Breaking The Curse Of Horizon In Off-policy Evaluation With Double Reinforcement Learning (2019)10.21
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Efficient Evaluation Of Natural Stochastic Policies In Offline Reinforcement Learning (2020)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00