An Instrumental Variable Approach To Confounded Off-policy Evaluation
2022 Β· Yang Xu, Jin Zhu, Chengchun Shi, et al.
Abstract
Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
Authors
(none)
Tags
Stats
Related papers
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00
- A Spectral Approach To Off-policy Evaluation For Pomdps (2021)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- Variance-aware Off-policy Evaluation With Linear Function Approximation (2021)0.00
- On Instrumental Variable Regression For Deep Offline Policy Evaluation (2021)0.00
- Conformal Off-policy Evaluation In Markov Decision Processes (2023)7.16
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00