Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning

Abstract

This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with \(S\) states, \(A\) actions, and horizon length \(H\), and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin\{align*\} \frac\{SAH^3\}\{\epsilon^2\} \text\{ sample episodes (up to log factor)\} \end\{align*\} without guidance of the reward information, our algorithm is able to find \(\epsilon\)-optimal policies for all these reward functions, provided that \(\epsilon\) is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds \(\frac\{S^2AH^3\}\{\epsilon^2\}\) episode

Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning

Abstract

Authors

Tags

Stats

Related papers