Policy Gradient For Robust Markov Decision Processes
2024 Β· Qiuhao Wang, Shaohang Xu, Chin Pang Ho, et al.
Abstract
We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete an
Authors
(none)
Tags
Stats
Related papers
- Mirror Descent Policy Optimisation For Robust Constrained Markov Decision Processes (2025)0.00
- Policy Optimization For Constrained Mdps With Provable Fast Global Convergence (2021)0.00
- Policy Gradient Method For Robust Reinforcement Learning (2022)0.00
- Robust Lagrangian And Adversarial Policy Gradient For Robust Constrained Markov Decision Processes (2023)2.26
- Optimal Convergence Rate For Exact Policy Mirror Descent In Discounted Markov Decision Processes (2023)0.00
- Policy Mirror Descent With Temporal Difference Learning: Sample Complexity Under Online Markov Data (2025)0.00
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00
- Stochastic First-order Methods For Average-reward Markov Decision Processes (2022)3.58