Learning And Planning In Average-reward Markov Decision Processes
2020 Β· Yi Wan, Abhishek Naik, Richard S. Sutton
Abstract
We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.
Authors
(none)
Tags
Stats
Related papers
- Planning And Learning In Average Risk-aware Mdps (2025)0.00
- On Convergence Of Average-reward Off-policy Control Algorithms In Weakly Communicating Mdps (2022)0.00
- Stochastic First-order Methods For Average-reward Markov Decision Processes (2022)3.58
- Sharper Model-free Reinforcement Learning For Average-reward Markov Decision Processes (2023)0.00
- Model-free Reinforcement Learning In Infinite-horizon Average-reward Markov Decision Processes (2019)0.00
- On Learning History Based Policies For Controlling Markov Decision Processes (2022)0.00
- Reinforcement Learning For Infinite-horizon Average-reward Linear Mdps Via Approximation By Discounted-reward Mdps (2024)0.00
- A General Markov Decision Process Framework For Directly Learning Optimal Control Policies (2019)0.00