Kernel Metric Learning For In-sample Off-policy Evaluation Of Deterministic RL Policies
2024 Β· Haanvid Lee, Tri Wahyu Guntara, Jongmin Lee, et al.
Abstract
We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric. In empirical studies
Authors
(none)
Tags
Stats
Related papers
- Local Metric Learning For Off-policy Evaluation In Contextual Bandits With Continuous Actions (2022)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- More Efficient Off-policy Evaluation Through Regularized Targeted Learning (2019)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- Doubly Robust Off-policy Value And Gradient Estimation For Deterministic Policies (2020)0.00
- Empirical Study Of Off-policy Policy Evaluation For Reinforcement Learning (2019)0.00
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data (2023)0.00