Provably Efficient Exploration In Policy Optimization
2019 Β· Qi Cai, Zhuoran Yang, Chi Jin, et al.
Abstract
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves \(\tilde\{O\}(\sqrt\{d^2 H^3 T\} )\) regret. Here \(d\) is the feature dimension, \(H\) is the episode horizon, and \(T\) is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
Authors
(none)
Tags
Stats
Related papers
- A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes (2023)0.00
- Optimistic Policy Optimization Is Provably Efficient In Non-stationary Mdps (2021)0.00
- Optimistic Natural Policy Gradient: A Simple Efficient Policy Optimization Framework For Online RL (2023)0.00
- Cautiously Optimistic Policy Optimization And Exploration With Linear Function Approximation (2021)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Nearly Optimal Policy Optimization With Stable At Any Time Guarantee (2021)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Policy Optimization With Model-based Explorations (2018)5.84