Model-based Learning Of Near-optimal Finite-window Policies In Pomdps
2026 Β· Philip Jordan, Maryam Kamgarpour
Abstract
We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight
Authors
(none)
Tags
Stats
Related papers
- Sequential Monte Carlo For Policy Optimization In Continuous Pomdps (2025)0.00
- Convergence Of Finite Memory Q-learning For Pomdps And Near Optimality Of Learned Policies Under Filter Stability (2021)0.00
- Finite-state Controllers For (hidden-model) Pomdps Using Deep Reinforcement Learning (2026)0.00
- Sample-efficient Learning Of Pomdps With Multiple Observations In Hindsight (2023)0.00
- Agent-state Based Policies In Pomdps: Beyond Belief-state Mdps (2024)0.00
- Hidden Markov Model Estimation-based Q-learning For Partially Observable Markov Decision Process (2018)5.84
- Statistical Tractability Of Off-policy Evaluation Of History-dependent Policies In Pomdps (2025)0.00
- Efficient Learning Of Pomdps With Known Observation Model In Average-reward Setting (2024)0.00