Abstract

This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with \(S\) states, \(A\) actions, and horizon length \(H\), and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin\{align*\} \frac\{SAH^3\}\{\epsilon^2\} \text\{ sample episodes (up to log factor)\} \end\{align*\} without guidance of the reward information, our algorithm is able to find \(\epsilon\)-optimal policies for all these reward functions, provided that \(\epsilon\) is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds \(\frac\{S^2AH^3\}\{\epsilon^2\}\) episode

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyli2023minimax

Related papers