Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function

Abstract

We present an algorithm based on the *Optimism in the Face of Uncertainty* (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function \(h^\{*\}\), the proposed algorithm achieves a regret bound of \(\tilde\{O\}(\sqrt\{SAHT\})\)\footnote\{The symbol \(\tilde\{O\}\) means \(O\) with log factors ignored. \} for MDP with \(S\) states and \(A\) actions, in the case that an upper bound \(H\) on the span of \(h^\{*\}\), i.e., \(sp(h^\{*\})\) is known. This result outperforms the best previous regret bounds \(\tilde\{O\}(S\sqrt\{AHT\}) \)\citep\{fruit2019improved\} by a factor of \(\sqrt\{S\}\). Furthermore, this regret bound matches the lower bound of \(Ω(\sqrt\{SAHT\}) \)\citep\{jaksch2010near\} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of \(\tilde\{O\}(\sqrt\{SADT\})\) for MDPs

Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function

Abstract

Authors

Tags

Stats

Related papers