Model-based Exploration In Monitored Markov Decision Processes
2025 Β· Alireza Kazemipour, Simone Parisi, Matthew E. Taylor, et al.
Abstract
A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e.g., a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for 'unsolvable' Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that observable rewards are reliably captured, and another to learn the minimax-optimal policy. Second, we empiric
Authors
(none)
Tags
Stats
Related papers
- Active Exploration In Markov Decision Processes (2019)0.00
- Conservative Exploration In Reinforcement Learning (2020)0.00
- Non-stationary Markov Decision Processes, A Worst-case Approach Using Model-based Reinforcement Learning, Extended Version (2019)0.00
- Reinforcement Learning In Reward-mixing Mdps (2021)0.00
- An Analysis Of Model-based Reinforcement Learning From Abstracted Observations (2022)0.00
- Optimal Decision-making In Mixed-agent Partially Observable Stochastic Environments Via Reinforcement Learning (2019)0.00
- Model-free Reinforcement Learning In Infinite-horizon Average-reward Markov Decision Processes (2019)0.00
- Learning Non-markovian Reward Models In Mdps (2020)0.00