Policy Optimization For Constrained Mdps With Provable Fast Global Convergence

Abstract

We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an \(\mathcal\{O\}(1/\sqrt\{T\})\) global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster \(\mathcal\{O\}(log(T)/T)\) convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation.

Policy Optimization For Constrained Mdps With Provable Fast Global Convergence

Abstract

Authors

Tags

Stats

Related papers