Abstract

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal \(\widetilde O (\sqrt K)\) regret where \(K\) denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~\(K\)) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~\(K\)) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keysherman2023rate

Related papers