Efficient Learning Of Pomdps With Known Observation Model In Average-reward Setting
2024 Β· Alessio Russo, Alberto Maria Metelli, Marcello Restelli
Abstract
Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the \(\textit\{optimism in the face of uncertainty\}\) principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP parameters. Given the estimated model, an optimization oracle computes the new optimal policy. W
Authors
(none)
Tags
Stats
Related papers
- Experimental Results : Reinforcement Learning Of Pomdps Using Spectral Methods (2017)0.00
- Robust Reinforcement Learning In Pomdps With Incomplete And Noisy Observations (2019)0.00
- Sample-efficient Learning Of Pomdps With Multiple Observations In Hindsight (2023)0.00
- Posterior Sampling-based Online Learning For Episodic Pomdps (2023)0.00
- Near-optimal Partially Observable Reinforcement Learning With Partial Online State Information (2023)0.00
- Robust Asymmetric Learning In Pomdps (2020)0.00
- A Spectral Approach To Off-policy Evaluation For Pomdps (2021)0.00
- Sequential Monte Carlo For Policy Optimization In Continuous Pomdps (2025)0.00