Abstract

In this paper we consider the problem of learning an \(\epsilon\)-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with \(S\) states, \(A\) actions, the discount factor \(\gamma \in (0,1)\), and an approximation threshold \(\epsilon > 0\), we provide a model-free algorithm to learn an \(\epsilon\)-optimal policy with sample complexity \(\tilde\{O\}(\frac\{SA\ln(1/p)\}\{\epsilon^2(1-\gamma)^\{5.5\}\})\) (where the notation \(\tilde\{O\}(\cdot)\) hides poly-logarithmic factors of \(S,A,1/(1-\gamma)\), and \(1/\epsilon\)) and success probability \((1-p)\). For small enough \(\epsilon\), we show an improved algorithm with sample complexity \(\tilde\{O\}(\frac\{SA\ln(1/p)\}\{\epsilon^2(1-\gamma)^\{3\}\})\). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on \(S\), our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Authors

(none)

Tags

  • Model-Based RL

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyzhang2020model

Related papers