Abstract

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider \(\gamma\)-discounted infinite-horizon Markov decision processes (MDPs) with state space \(\mathcal\{S\}\) and action space \(\mathcal\{A\}\). Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least \(\frac\{|\mathcal\{S\}||\mathcal\{A\}|\}\{(1-\gamma)^2\}\). The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of \(\frac\{|\mathcal\{S\}||\mathcal\{A\}|\}\{1-\gamma\}\) (modulo some log fa

Authors

(none)

Tags

  • Model-Based RL

Stats

  • citations15
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score9.03
  • arxiv keyli2020breaking

Related papers