RL In Latent Mdps Is Tractable: Online Guarantees Via Off-policy Evaluation
2024 Β· Jeongyeol Kwon, Shie Mannor, Constantine Caramanis, et al.
Abstract
In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new cove
Authors
(none)
Tags
Stats
Related papers
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- Proximal Reinforcement Learning: Efficient Off-policy Evaluation In Partially Observed Markov Decision Processes (2021)0.00
- Matrix Estimation For Offline Reinforcement Learning With Low-rank Structure (2023)0.00
- Variational Latent Branching Model For Off-policy Evaluation (2023)0.00
- Off-policy Evaluation In Markov Decision Processes Under Weak Distributional Overlap (2024)0.00
- Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information (2023)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00