Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes \(\mathtt\{VRMPO\}\) algorithm: a sample efficient policy gradient method with stochastic mirror descent. In \(\mathtt\{VRMPO\}\), a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed \(\mathtt\{VRMPO\}\) needs only \(\mathcal\{O\}(\epsilon^\{-3\})\) sample trajectories to achieve an \(\epsilon\)-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that \(\mathtt\{VRMPO\}\) outperforms the state-of-the-art policy gradient methods in various settings.

Authors

(none)

Tags

  • Policy Gradient

Stats

  • citations9
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score7.50
  • arxiv keyyang2019policy

Related papers