Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning
2019 Β· Nathan Kallus, Masatoshi Uehara
Abstract
Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.
Authors
(none)
Tags
Stats
Related papers
- More Efficient Off-policy Evaluation Through Regularized Targeted Learning (2019)0.00
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- Empirical Study Of Off-policy Policy Evaluation For Reinforcement Learning (2019)0.00
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Conformal Off-policy Evaluation In Markov Decision Processes (2023)7.16
- Doubly Robust Estimator For Off-policy Evaluation With Large Action Spaces (2023)0.00
- Doubly Robust Distributionally Robust Off-policy Evaluation And Learning (2022)0.00