Proximal Reinforcement Learning: Efficient Off-policy Evaluation In Partially Observed Markov Decision Processes
2021 Β· Andrew Bennett, Nathan Kallus
Abstract
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functio
Authors
(none)
Tags
Stats
Related papers
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00
- Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information (2023)0.00
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Conformal Off-policy Evaluation In Markov Decision Processes (2023)7.16
- Robust Reinforcement Learning In Pomdps With Incomplete And Noisy Observations (2019)0.00
- Reinforcement Learning From Partial Observation: Linear Function Approximation With Provable Sample Efficiency (2022)0.00