Softmax Policy Gradient Methods Can Take Exponential Time To Converge
2021 Β· Gen Li, Yuting Wei, Yuejie Chi, et al.
Abstract
The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For \(\gamma\)-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space \(\mathcal\{S\}\) and the effective horizon \(\frac\{1\}\{1-\gamma\}\), both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize \(\eta\) can take \[ \frac\{1\}\{\eta\} |\mathcal\{S\}|
Authors
(none)
Tags
Stats
Related papers
- On The Global Convergence Rates Of Softmax Policy Gradient Methods (2020)0.00
- An Alternate Policy Gradient Estimator For Softmax Policies (2021)0.00
- Elementary Analysis Of Policy Gradient Methods (2024)0.00
- Rethinking The Global Convergence Of Softmax Policy Gradient With Linear Function Approximation (2025)0.00
- On The Global Convergence Rates Of Decentralized Softmax Gradient Play In Markov Potential Games (2022)0.00
- Fast Global Convergence Of Natural Policy Gradient Methods With Entropy Regularization (2020)0.00
- Convergence And Price Of Anarchy Guarantees Of The Softmax Policy Gradient In Markov Potential Games (2022)0.00
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00