Nearly Optimal Policy Optimization With Stable At Any Time Guarantee
2021 Β· Tianhao Wu, Yunchang Yang, Han Zhong, et al.
Abstract
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet\{shani2020optimistic\} is only \(\tilde\{O\}(\sqrt\{S^2AH^4K\})\) where \(S\) is the number of states, \(A\) is the number of actions, \(H\) is the horizon, and \(K\) is the number of episodes, and there is a \(\sqrt\{SH\}\) gap compared with the information theoretic lower bound \(\tilde\{Ξ©\}(\sqrt\{SAH^3K\})\). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves \(\tilde\{O\}(\sqrt\{SAH^3K\} + \sqrt\{AH^4K\})\) regret. When \(S > H\), our algorithm is minimax optimal when ignoring logarithmic factors. To
Authors
(none)
Tags
Stats
Related papers
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Optimistic Policy Optimization Is Provably Efficient In Non-stationary Mdps (2021)0.00
- A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes (2023)0.00
- Conservative Optimistic Policy Optimization Via Multiple Importance Sampling (2021)0.00
- Moments Matter:stabilizing Policy Optimization Using Return Distributions (2026)0.00
- Absolute Policy Optimization (2023)0.00
- Policy Optimization For Markov Games: Unified Framework And Faster Convergence (2022)0.00
- Cautiously Optimistic Policy Optimization And Exploration With Linear Function Approximation (2021)0.00