Policy Optimization With Stochastic Mirror Descent
2019 Β· Long Yang, Yu Zhang, Gang Zheng, et al.
Abstract
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes \(\mathtt\{VRMPO\}\) algorithm: a sample efficient policy gradient method with stochastic mirror descent. In \(\mathtt\{VRMPO\}\), a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed \(\mathtt\{VRMPO\}\) needs only \(\mathcal\{O\}(\epsilon^\{-3\})\) sample trajectories to achieve an \(\epsilon\)-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that \(\mathtt\{VRMPO\}\) outperforms the state-of-the-art policy gradient methods in various settings.
Authors
(none)
Tags
Stats
Related papers
- Bregman Gradient Policy Optimization (2021)0.00
- Mirror Descent Policy Optimisation For Robust Constrained Markov Decision Processes (2025)0.00
- Sample Efficient Policy Gradient Methods With Recursive Variance Reduction (2019)0.00
- Stochastic Variance Reduction For Policy Gradient Estimation (2017)0.00
- Policy Mirror Descent With Temporal Difference Learning: Sample Complexity Under Online Markov Data (2025)0.00
- Policy Mirror Descent Inherently Explores Action Space (2023)2.26
- Proximal Policy Optimization Algorithms (2017)0.00
- An Improved Convergence Analysis Of Stochastic Variance-reduced Policy Gradient (2019)0.00