Policy Optimization With Stochastic Mirror Descent

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes \(\mathtt\{VRMPO\}\) algorithm: a sample efficient policy gradient method with stochastic mirror descent. In \(\mathtt\{VRMPO\}\), a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed \(\mathtt\{VRMPO\}\) needs only \(\mathcal\{O\}(\epsilon^\{-3\})\) sample trajectories to achieve an \(\epsilon\)-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that \(\mathtt\{VRMPO\}\) outperforms the state-of-the-art policy gradient methods in various settings.

Policy Optimization With Stochastic Mirror Descent

Abstract

Authors

Tags

Stats

Related papers