Asymptotically Optimal Regret In Communicating Markov Decision Processes
2025 Β· Victor Boone
Abstract
In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process \(M\), our algorithm has regret \(K(M) log(T) + \mathrm\{o\}(log(T))\) where \(T\) is the number of learning steps and \(K(M)\) is the best possible constant. This algorithm works by explicitly tracking the constant \(K(M)\) to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function \(K(M)\) is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate \(K(M)\) with arbitrary precision from empirical data.
Authors
(none)
Tags
Stats
Related papers
- Logarithmic Regret Of Exploration In Average Reward Markov Decision Processes (2025)0.00
- Regret Minimization And Convergence To Equilibria In General-sum Markov Games (2022)0.00
- Rate-optimal Policy Optimization For Linear Markov Decision Processes (2023)0.00
- Logarithmic Regret Bounds For Continuous-time Average-reward Markov Decision Processes (2022)5.24
- Efficient Exploration In Average-reward Constrained Reinforcement Learning: Achieving Near-optimal Regret With Posterior Sampling (2024)0.00
- Posterior Sampling For Reinforcement Learning: Worst-case Regret Bounds (2017)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00