Abstract

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is \(\propto\widetilde\{\mathcal\{O\}\}(1/\epsilon^2)\), where \(\epsilon>0\) is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, \(\rho>0\), to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only \(\widetilde\{\mathcal\{O\}\} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )\) episodes of exploration, and is able to obtain an \(\epsilon\)-optimal policy for a post-reveale

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keywu2021gap

Related papers