Strongly-polynomial Time And Validation Analysis Of Policy Gradient Methods
2024 Β· Caleb Ju, Guanghui Lan
Abstract
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that policy gradient methods can solve MDPs in strongly-polynomial time. To the best of our knowledge, this is the first time that such strong convergence properties have been established for policy gradient methods. Moreover, in the stochastic setting, where only stochastic estimates of policy gradients are available, we show that the advantage gap function provides close approximations of the optimality gap for each individual state and exhibits a sublinear rate of convergence at every state. The advantage gap function can be easily estimated in the stochastic case, and when coupled with eas
Authors
(none)
Tags
Stats
Related papers
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00
- On The Linear Convergence Of Natural Policy Gradient Algorithm (2021)0.00
- Convergence And Optimality Of Policy Gradient Methods In Weakly Smooth Settings (2021)3.58
- Global Convergence Of Policy Gradient Methods In Reinforcement Learning, Games And Control (2023)0.00
- Deterministic Policy Gradient For Reinforcement Learning With Continuous Time And State (2025)0.00
- Policy Gradient In Partially Observable Environments: Approximation And Convergence (2018)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- A Study Of Policy Gradient On A Class Of Exactly Solvable Models (2020)0.00