Confidence Interval For Off-policy Evaluation From Dependent Samples Via Bandit Algorithm: Approach From Standardized Martingales
2020 Β· Masahiro Kato
Abstract
This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d. In this study, we address this problem by constructing an estimator from a standardized martingale difference sequence. To standardize the sequence, we consider using evaluation data or sample splitting with a two-step estimation. This technique produces an estimator with asymptotic normality without restricting a class of behavior policies. In an experiment, the proposed estimator performs better than existing methods, which assume that the behavior policy con
Authors
(none)
Tags
Stats
Related papers
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- Anytime-valid Off-policy Inference For Contextual Bandits (2022)2.26
- An Instrumental Variable Approach To Confounded Off-policy Evaluation (2022)0.00
- Non-asymptotic Confidence Intervals Of Off-policy Evaluation: Primal And Dual Bounds (2021)0.00
- Off-policy Evaluation And Learning From Logged Bandit Feedback: Error Reduction Via Surrogate Policy (2018)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- Local Metric Learning For Off-policy Evaluation In Contextual Bandits With Continuous Actions (2022)0.00
- Doubly Robust Estimator For Off-policy Evaluation With Large Action Spaces (2023)0.00