Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity

Abstract

In this paper we consider the problem of learning an \(\epsilon\)-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with \(S\) states, \(A\) actions, the discount factor \(\gamma \in (0,1)\), and an approximation threshold \(\epsilon > 0\), we provide a model-free algorithm to learn an \(\epsilon\)-optimal policy with sample complexity \(\tilde\{O\}(\frac\{SA\ln(1/p)\}\{\epsilon^2(1-\gamma)^\{5.5\}\})\) (where the notation \(\tilde\{O\}(\cdot)\) hides poly-logarithmic factors of \(S,A,1/(1-\gamma)\), and \(1/\epsilon\)) and success probability \((1-p)\). For small enough \(\epsilon\), we show an improved algorithm with sample complexity \(\tilde\{O\}(\frac\{SA\ln(1/p)\}\{\epsilon^2(1-\gamma)^\{3\}\})\). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on \(S\), our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Model-free Reinforcement Learning: From Clipped Pseudo-regret To Sample Complexity

Abstract

Authors

Tags

Stats

Related papers