Abstract

We give a simple optimistic algorithm for which it is easy to derive regret bounds of \(\tilde\{O\}(\sqrt\{t_\{\rm mix\} SAT\})\) after \(T\) steps in uniformly ergodic Markov decision processes with \(S\) states, \(A\) actions, and mixing time parameter \(t_\{\rm mix\}\). These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations15
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score9.03
  • arxiv keyortner2018regret

Related papers