Optimal Convergence Rate For Exact Policy Mirror Descent In Discounted Markov Decision Processes
2023 Β· Emmeran Johnson, Ciara Pike-Burke, Patrick Rebeschini
Abstract
Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, PMD algorithmically regularises the policy improvement step of PI. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor \(\gamma\) of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free \(\gamma\)-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the \(\gamma\)-rate is optimal for PMD methods as well as PI, and that the adaptive step-size is necessary for PMD to achieve it. Our work is the first to relate PMD to rate-optimality and step-size
Authors
(none)
Tags
Stats
Related papers
- Policy Mirror Descent With Temporal Difference Learning: Sample Complexity Under Online Markov Data (2025)0.00
- Policy Optimization For Constrained Mdps With Provable Fast Global Convergence (2021)0.00
- Learning Mirror Maps In Policy Mirror Descent (2024)0.00
- Mirror Descent Policy Optimisation For Robust Constrained Markov Decision Processes (2025)0.00
- Policy Gradient For Robust Markov Decision Processes (2024)0.00
- On The Linear Convergence Of Natural Policy Gradient Algorithm (2021)0.00
- Stochastic First-order Methods For Average-reward Markov Decision Processes (2022)3.58
- Policy Mirror Descent Inherently Explores Action Space (2023)2.26