Behaviour Policy Estimation In Off-policy Policy Evaluation: Calibration Matters
2018 Β· Aniruddh Raghu, Omer Gottesman, Yao Liu, et al.
Abstract
In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of empirical studies, we demonstrate how accurate OPE is strongly dependent on the calibration of estimated behaviour policy models: how precisely the behaviour policy is estimated from data. We show how powerful parametric models such as neural networks can result in highly uncalibrated behaviour policy models on a real-world medical dataset, and illustrate how a simple, non-parametric, k-nearest neighbours model produces better calibrated behaviour policy estimates and can be used to obtain superior importance sampling-based OPE estimates.
Authors
(none)
Tags
Stats
Related papers
- Importance Sampling Policy Evaluation With An Estimated Behavior Policy (2018)0.00
- Data-efficient Policy Evaluation Through Behavior Policy Search (2017)0.00
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Infinite-horizon Off-policy Policy Evaluation With Multiple Behavior Policies (2019)0.00
- Empirical Study Of Off-policy Policy Evaluation For Reinforcement Learning (2019)0.00
- Conformal Off-policy Evaluation In Markov Decision Processes (2023)7.16
- Variance-aware Off-policy Evaluation With Linear Function Approximation (2021)0.00
- Interpretable Off-policy Evaluation In Reinforcement Learning By Highlighting Influential Transitions (2020)0.00