Future-dependent Value-based Off-policy Evaluation In Pomdps
2022 Β· Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, et al.
Abstract
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods t
Authors
(none)
Tags
Stats
Related papers
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00
- A Spectral Approach To Off-policy Evaluation For Pomdps (2021)0.00
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00
- A Maximum-entropy Approach To Off-policy Evaluation In Average-reward Mdps (2020)0.00
- Variance-aware Off-policy Evaluation With Linear Function Approximation (2021)0.00
- An Instrumental Variable Approach To Confounded Off-policy Evaluation (2022)0.00
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Sequential Monte Carlo For Policy Optimization In Continuous Pomdps (2025)0.00