Abstract

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with \(S\) states, \(A\) actions and horizon length \(H\), substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of \(\sqrt\{H^2SAT\}\) (modulo log factors) with \(T\) the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., \(S^6A^4 \,\mathrm\{poly\}(H)\) for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity \(O(SAH)\), that achieves near-optimal regret as soon as the sample size exceeds the order of \(SA\,\mathrm\{poly\}(H)\). In terms of this sam

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyli2021breaking

Related papers