Low-switching Policy Gradient With Exploration Via Online Sensitivity Sampling
2023 Β· Yunfan Li, Yiran Wang, Yu Cheng, et al.
Abstract
Policy optimization methods are powerful algorithms in Reinforcement Learning (RL) for their flexibility to deal with policy parameterization and ability to handle model misspecification. However, these methods usually suffer from slow convergence rates and poor sample complexity. Hence it is important to design provably sample efficient algorithms for policy optimization. Yet, recent advances for this problems have only been successful in tabular and linear setting, whose benign structures cannot be generalized to non-linearly parameterized policies. In this paper, we address this problem by leveraging recent advances in value-based algorithms, including bounded eluder-dimension and online sensitivity sampling, to design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation. We show that, our algorithm obtains an \(\epsilon\)-optimal policy with only \(\widetilde\{O\}(\frac\{\text\{poly\}(d)\}\{\epsilon^3\})\) samples, wher
Authors
(none)
Tags
Stats
Related papers
- Cautiously Optimistic Policy Optimization And Exploration With Linear Function Approximation (2021)0.00
- Optimistic Natural Policy Gradient: A Simple Efficient Policy Optimization Framework For Online RL (2023)0.00
- Conservative Optimistic Policy Optimization Via Multiple Importance Sampling (2021)0.00
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Policy Optimization As Online Learning With Mediator Feedback (2020)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- On The Sample Complexity Of Differentially Private Policy Optimization (2025)0.00