Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes
2022 Β· Miao Lu, Yifei Min, Zhaoran Wang, et al.
Abstract
We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline\{P\}roxy variable \underline\{P\}essimistic \underline\{P\}olicy \underline\{O\}ptimization (\texttt\{P3O\}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt\{P3O\} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt\{P3O\} achieves
Authors
(none)
Tags
Stats
Related papers
- Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage (2023)0.00
- Proximal Reinforcement Learning: Efficient Off-policy Evaluation In Partially Observed Markov Decision Processes (2021)0.00
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00
- Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information (2023)0.00
- POPO: Pessimistic Offline Policy Optimization (2020)5.24
- Pessimistic Causal Reinforcement Learning With Mediators For Confounded Offline Data (2024)0.00
- Bridging Offline Reinforcement Learning And Imitation Learning: A Tale Of Pessimism (2021)0.00
- Offline Reinforcement Learning With Instrumental Variables In Confounded Markov Decision Processes (2022)0.00