The Best Of Both Worlds: Reinforcement Learning With Logarithmic Regret And Policy Switches
2022 Β· Grigoris Velegkas, Zhuoran Yang, Amin Karbasi
Abstract
In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon \(T\), provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with \(O(log T)\) switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our results with lower bounds which show that even in the tabular setting, we cannot hope for regret guar
Authors
(none)
Tags
Stats
Related papers
- Logarithmic Regret For Episodic Continuous-time Linear-quadratic Reinforcement Learning Over A Finite-time Horizon (2020)7.81
- Sublinear Regret For A Class Of Continuous-time Linear-quadratic Reinforcement Learning Problems (2024)0.00
- Model-based Reinforcement Learning With Multinomial Logistic Function Approximation (2022)2.26
- Logarithmic Regret Of Exploration In Average Reward Markov Decision Processes (2025)0.00
- Model-based Reinforcement Learning With Double Oracle Efficiency In Policy Optimization And Offline Estimation (2026)0.00
- Breaking The Sample Complexity Barrier To Regret-optimal Model-free Reinforcement Learning (2021)0.00
- Regret-optimal Q-learning With Low Cost For Single-agent And Federated Reinforcement Learning (2025)0.00
- Online Reinforcement Learning In Markov Decision Process Using Linear Programming (2023)3.58