Logarithmic Regret Of Exploration In Average Reward Markov Decision Processes
2025 Β· Victor Boone, Bruno Gaujal
Abstract
In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm's regret is, both in theory and in practice, as good if not better than with (DT), while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmi
Authors
(none)
Tags
Stats
Related papers
- Asymptotically Optimal Regret In Communicating Markov Decision Processes (2025)0.00
- Logarithmic Regret Bounds For Continuous-time Average-reward Markov Decision Processes (2022)5.24
- Square-root Regret Bounds For Continuous-time Episodic Markov Decision Processes (2022)2.26
- Reinforcement Learning Algorithms For Regret Minimization In Structured Markov Decision Processes (2016)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00
- \(\sqrt{n}\)-regret For Learning In Markov Decision Processes With Function Approximation And Low Bellman Rank (2019)0.00
- Optimism And Delays In Episodic Reinforcement Learning (2021)0.00