Abstract

While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For \(d\)-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an \(\epsilon\)-optimal policy within \(\tilde\{O\}(d^2/\epsilon^3)\) samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence \(\tilde\{\Theta\}(d^2)\). It also improves over stat

Authors

(none)

Tags

  • Policy Gradient
  • Exploration

Stats

  • citations0
  • S2 citations
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyliu2023optimistic

Related papers