Learning Non-markovian Reward Models In Mdps
2020 · Gavin Rens, Jean-François Raskin
Abstract
There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks. In other words, the reward that the agent receives is non-Markovian. One natural and quite general way to represent history-dependent rewards is via a Mealy machine; a finite state automaton that produces output sequences (rewards in our case) from input sequences (state/action observations in our case). In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves and a Mealy machine synchronised with this MDP to formalise the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt. Learning non-Markov reward functions is a challenge. Our approach to overcome this challenging problem is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing
Authors
(none)
Tags
Stats
Related papers
- Inferring Probabilistic Reward Machines From Non-markovian Reward Processes For Reinforcement Learning (2021)0.00
- Reinforcement Learning In Reward-mixing Mdps (2021)0.00
- Policy Dispersion In Non-markovian Environment (2023)0.00
- Reward Is Enough For Convex Mdps (2021)0.00
- Model-based Exploration In Monitored Markov Decision Processes (2025)0.00
- Inverse Reinforcement Learning In Contextual Mdps (2019)8.82
- Learning Task Automata For Reinforcement Learning Using Hidden Markov Models (2022)2.26
- Decentralized Graph-based Multi-agent Reinforcement Learning Using Reward Machines (2021)0.00