Local Metric Learning For Off-policy Evaluation In Contextual Bandits With Continuous Actions
2022 Β· Haanvid Lee, Jongmin Lee, Yunseon Choi, et al.
Abstract
We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent,
Authors
(none)
Tags
Stats
Related papers
- Kernel Metric Learning For In-sample Off-policy Evaluation Of Deterministic RL Policies (2024)0.00
- Anytime-valid Off-policy Inference For Contextual Bandits (2022)2.26
- Doubly Robust Estimator For Off-policy Evaluation With Large Action Spaces (2023)0.00
- Bayesian Off-policy Evaluation And Learning For Large Action Spaces (2024)0.00
- Policy Evaluation And Optimization With Continuous Treatments (2018)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- POTEC: Off-policy Learning For Large Action Spaces Via Two-stage Policy Decomposition (2024)0.00
- Off-policy Evaluation And Learning From Logged Bandit Feedback: Error Reduction Via Surrogate Policy (2018)0.00