Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space \(\mathcal\{S\}\) and the action space \(\mathcal\{A\}\) are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with \(|\mathcal\{S\}|\times|\mathcal\{A\}|\), which can be prohibitively large when \(\mathcal\{S\}\) or \(\mathcal\{A\}\) is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp.\(~\)Q-learning) provably learns an \(\epsilon\)-optimal policy (resp.\(~\)Q-function) with high probability as soon as the sample size exceeds the order of \(\frac\{K\}\{(1-\gamma)^\{3\}\epsilon^\{2\}\}\) (resp.\(~\)\(\frac\{K\}\{(1-\gamma)^\{4\}\epsilon^\{2\}\}\)), up to some logarithmic factor. Here \(K\) is

Authors

(none)

Tags

  • Model-Based RL

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keywang2021sample

Related papers