Off-policy Evaluation In Markov Decision Processes Under Weak Distributional Overlap
2024 Β· Mohammad Mehrabi, Stefan Wager
Abstract
Doubly robust methods hold considerable promise for off-policy evaluation in Markov decision processes (MDPs) under sequential ignorability: They have been shown to converge as \(1/\sqrt\{T\}\) with the horizon \(T\), to be statistically efficient in large samples, and to allow for modular implementation where preliminary estimation tasks can be executed using standard reinforcement learning techniques. Existing results, however, make heavy use of a strong distributional overlap assumption whereby the stationary distributions of the target policy and the data-collection policy are within a bounded factor of each other -- and this assumption is typically only credible when the state space of the MDP is bounded. In this paper, we re-visit the task of off-policy evaluation in MDPs under a weaker notion of distributional overlap, and introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting. When the distribution ratio of the target and dat
Authors
(none)
Tags
Stats
Related papers
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Robust Anytime Learning Of Markov Decision Processes (2022)0.00
- Doubly Robust Distributionally Robust Off-policy Evaluation And Learning (2022)0.00
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00
- Twice Regularized Markov Decision Processes: The Equivalence Between Robustness And Regularization (2023)0.00
- Twice Regularized Mdps And The Equivalence Between Robustness And Regularization (2021)0.00
- Linear Mixture Distributionally Robust Markov Decision Processes (2025)0.00
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00