Regret Bounds For Reinforcement Learning Via Markov Chain Concentration
2018 Β· Ronald Ortner
Abstract
We give a simple optimistic algorithm for which it is easy to derive regret bounds of \(\tilde\{O\}(\sqrt\{t_\{\rm mix\} SAT\})\) after \(T\) steps in uniformly ergodic Markov decision processes with \(S\) states, \(A\) actions, and mixing time parameter \(t_\{\rm mix\}\). These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
Authors
(none)
Tags
Stats
Related papers
- Minimax Regret Bounds For Reinforcement Learning (2017)0.00
- Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function (2019)0.00
- Posterior Sampling For Reinforcement Learning: Worst-case Regret Bounds (2017)0.00
- Square-root Regret Bounds For Continuous-time Episodic Markov Decision Processes (2022)2.26
- Optimistic Posterior Sampling For Reinforcement Learning With Few Samples And Tight Guarantees (2022)0.00
- Breaking The Sample Complexity Barrier To Regret-optimal Model-free Reinforcement Learning (2021)0.00
- Logarithmic Regret Bounds For Continuous-time Average-reward Markov Decision Processes (2022)5.24
- Efficient Exploration In Average-reward Constrained Reinforcement Learning: Achieving Near-optimal Regret With Posterior Sampling (2024)0.00