Optimistic Policy Optimization Is Provably Efficient In Non-stationary Mdps
2021 Β· Han Zhong, Zhongren Chen, Zhuoran Yang, et al.
Abstract
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the \underline\{p\}eriodically \underline\{r\}estarted \underline\{o\}ptimistic \underline\{p\}olicy \underline\{o\}ptimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a minimax lower bound which shows the (ne
Authors
(none)
Tags
Stats
Related papers
- A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes (2023)0.00
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Near-optimal Policy Optimization Algorithms For Learning Adversarial Linear Mixture Mdps (2021)0.00
- Cautiously Optimistic Policy Optimization And Exploration With Linear Function Approximation (2021)0.00
- Nearly Optimal Policy Optimization With Stable At Any Time Guarantee (2021)0.00
- Efficient Learning In Non-stationary Linear Markov Decision Processes (2020)6.77
- Optimistic Natural Policy Gradient: A Simple Efficient Policy Optimization Framework For Online RL (2023)0.00
- Warm-up Free Policy Optimization: Improved Regret In Linear Markov Decision Processes (2024)0.00