Policy Learning "without" Overlap: Pessimism And Generalized Empirical Bernstein's Inequality
2022 Β· Ying Jin, Zhimei Ren, Zhuoran Yang, et al.
Abstract
This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap c
Authors
(none)
Tags
Stats
Related papers
- Logarithmic Smoothing For Pessimistic Off-policy Evaluation, Selection And Learning (2024)0.00
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00
- Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage (2023)0.00
- Optimistic Policy Learning Under Pessimistic Adversaries With Regret And Violation Guarantees (2026)0.00
- POPO: Pessimistic Offline Policy Optimization (2020)5.24
- State-aware Proximal Pessimistic Algorithms For Offline Reinforcement Learning (2022)0.00
- Federated Offline Policy Learning (2023)0.00