Abstract

Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than reward-aware (PAC) RL -- where the agent has access to the reward function during exploration -- with optimal sample complexities in the two settings differing by a factor of \(|\mathcal\{S\}|\), the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a \(d\)-dimensional linear MDP with sample complexity scaling as \(\widetilde\{\mathcal\{O\}\}(d^2 H^5/\epsilon^2)\). We then show a lower bound with matching dimension-dependence of \(Ξ©(d^2 H^2/\epsilon^2)\), which holds for the reward-aware RL setting. To our knowledge, our approach is the

Authors

(none)

Tags

  • Exploration

Stats

Related papers