Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information

Abstract

Partially observable Markov decision processes (POMDPs) are a general framework for sequential decision-making under latent state uncertainty, yet learning in POMDPs is intractable in the worst case. Motivated by sensing and probing constraints in practice, we study how much online state information (OSI) is sufficient to enable efficient learning guarantees. We formalize a model in which the learner can query only partial OSI (POSI) during interaction. We first prove an information-theoretic hardness result showing that, for general POMDPs, achieving an \(\epsilon\)-optimal policy can require sample complexity that is exponential unless full OSI is available. We then identify two structured subclasses that remain learnable under POSI and propose corresponding algorithms with provably efficient performance guarantees. In particular, we establish regret upper bounds with \(\tilde\{O\}(\sqrt\{K\})\) dependence on the number of episodes \(K\), together with complementary lower bounds, the

Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information

Abstract

Authors

Tags

Stats

Related papers