Non-asymptotic Confidence Intervals Of Off-policy Evaluation: Primal And Dual Bounds
2021 Β· Yihao Feng, Ziyang Tang, Na Zhang, et al.
Abstract
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al.(2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum a
Authors
(none)
Tags
Stats
Related papers
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00
- Bootstrapping With Models: Confidence Intervals For Off-policy Evaluation (2016)9.23
- Efficiently Breaking The Curse Of Horizon In Off-policy Evaluation With Double Reinforcement Learning (2019)10.21
- Doubly Robust Distributionally Robust Off-policy Evaluation And Learning (2022)0.00
- Conformal Off-policy Evaluation In Markov Decision Processes (2023)7.16
- Doubly Robust Interval Estimation For Optimal Policy Evaluation In Online Learning (2021)0.00