Dynamic Regret Of Online Markov Decision Processes
2022 Β· Peng Zhao, Long-Fei Li, Zhi-Hua Zhou
Abstract
We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds
Authors
(none)
Tags
Stats
Related papers
- Online Convex Optimization In Adversarial Markov Decision Processes (2019)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58
- Learning Adversarial Markov Decision Processes With Delayed Feedback (2020)0.00
- Online Markov Decision Processes With Aggregate Bandit Feedback (2021)0.00
- Near-optimal Regret Using Policy Optimization In Online Mdps With Aggregate Bandit Feedback (2025)0.00
- Towards Optimal Regret In Adversarial Linear Mdps With Bandit Feedback (2023)0.00
- Refined Regret For Adversarial Mdps With Linear Function Approximation (2023)0.00
- Regret Analysis In Deterministic Reinforcement Learning (2021)0.00