Abstract

arXiv:2408.11513v2 Announce Type: replace Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, \(\epsilon_\{\mathrm\{bias\}\}\), PDR-ANPG achieves a last-iterate \(\epsilon\) optimality gap and \(\epsilon\) constraint violation with a sample complexity of \(\tilde\{\mathcal\{O\}\}(\epsilon^\{-2\}\min\\{\epsilon^\{-2\},\epsilon_\{\mathrm\{bias\}\}^\{-\frac\{1\}\{3\}\}\\})\). If the class is incomplete (\(\epsilon_\{\mathrm\{bias\}\}>0\)), then the sample complexity reduces to \(\tilde\{\mathcal\{O\}\}(\epsilon^\{-2\})\) for \(\epsilon<(\epsilon_\{\mathrm\{bias\}\})^\{\frac\{1\}\{6\}\}\). Moreover, for complete policies with \(\epsilon_\{\mathrm\{bias\}\}=0\), our algorit

Authors

(none)

Tags

  • Policy Gradient

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keymondal2026last

Related papers