A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes
2023 Β· Han Zhong, Tong Zhang
Abstract
The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a \(\tilde\{\mathcal\{O\}\}(d^\{3/4\}H^2K^\{3/4\})\) regret for it. Here \(d\) is the ambient dimension of linear MDPs, \(H\) is the length of each episode, and \(K\) is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched up
Authors
(none)
Tags
Stats
Related papers
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Optimistic Policy Optimization Is Provably Efficient In Non-stationary Mdps (2021)0.00
- Truly Proximal Policy Optimization (2019)0.00
- KIPPO: Koopman-inspired Proximal Policy Optimization (2025)0.00
- Warm-up Free Policy Optimization: Improved Regret In Linear Markov Decision Processes (2024)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Delay-adapted Policy Optimization And Improved Regret For Adversarial MDP With Delayed Bandit Feedback (2023)0.00
- Near-optimal Policy Optimization Algorithms For Learning Adversarial Linear Mixture Mdps (2021)0.00