Importance Sampling Policy Evaluation With An Estimated Behavior Policy
2018 Β· Josiah P. Hanna, Scott Niekum, Peter Stone
Abstract
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance s
Authors
(none)
Tags
Stats
Related papers
- Low Variance Off-policy Evaluation With State-based Importance Sampling (2022)0.00
- Data-efficient Policy Evaluation Through Behavior Policy Search (2017)0.00
- Infinite-horizon Off-policy Policy Evaluation With Multiple Behavior Policies (2019)0.00
- Behaviour Policy Estimation In Off-policy Policy Evaluation: Calibration Matters (2018)0.00
- Conditional Importance Sampling For Off-policy Learning (2019)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- Optimal Mixture Weights For Off-policy Evaluation With Multiple Behavior Policies (2020)0.00
- Counterfactual-augmented Importance Sampling For Semi-offline Policy Evaluation (2023)0.00