Doubly Robust Interval Estimation For Optimal Policy Evaluation In Online Learning
2021 Β· Ye Shen, Hengrui Cai, Rui Song
Abstract
Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust i
Authors
(none)
Tags
Stats
Related papers
- Online Estimation And Inference For Robust Policy Evaluation In Reinforcement Learning (2023)2.26
- Doubly Optimal Policy Evaluation For Reinforcement Learning (2024)0.00
- Doubly Robust Off-policy Value And Gradient Estimation For Deterministic Policies (2020)0.00
- Adaptive Doubly Robust Estimator From Non-stationary Logging Policy Under A Convergence Of Average Probability (2021)0.00
- Robust Fitted-q-evaluation And Iteration Under Sequentially Exogenous Unobserved Confounders (2023)0.00
- Towards Robust Off-policy Learning For Runtime Uncertainty (2022)0.00
- Near-optimal Provable Uniform Convergence In Offline Policy Evaluation For Reinforcement Learning (2020)0.00
- Efficient Evaluation Of Natural Stochastic Policies In Offline Reinforcement Learning (2020)0.00