Last-iterate Convergence Of General Parameterized Policies In Constrained Mdps

Abstract

arXiv:2408.11513v2 Announce Type: replace Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, \(\epsilon_\{\mathrm\{bias\}\}\), PDR-ANPG achieves a last-iterate \(\epsilon\) optimality gap and \(\epsilon\) constraint violation with a sample complexity of \(\tilde\{\mathcal\{O\}\}(\epsilon^\{-2\}\min\\{\epsilon^\{-2\},\epsilon_\{\mathrm\{bias\}\}^\{-\frac\{1\}\{3\}\}\\})\). If the class is incomplete (\(\epsilon_\{\mathrm\{bias\}\}>0\)), then the sample complexity reduces to \(\tilde\{\mathcal\{O\}\}(\epsilon^\{-2\})\) for \(\epsilon<(\epsilon_\{\mathrm\{bias\}\})^\{\frac\{1\}\{6\}\}\). Moreover, for complete policies with \(\epsilon_\{\mathrm\{bias\}\}=0\), our algorit

Last-iterate Convergence Of General Parameterized Policies In Constrained Mdps

Abstract

Authors

Tags

Stats

Related papers