The Fallacy Of Minimizing Cumulative Regret In The Sequential Task Setting
2024 Β· Ziping Xu, Kelly W. Zhang, Susan A. Murphy
Abstract
Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, \(T\). In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading
Authors
(none)
Tags
Stats
Related papers
- Model-free Online Learning In Unknown Sequential Decision Making Problems And Games (2021)5.24
- Reinforcement Learning Algorithms For Regret Minimization In Structured Markov Decision Processes (2016)0.00
- Enabling Realtime Reinforcement Learning At Scale With Staggered Asynchronous Inference (2024)0.00
- A Reduction From Reinforcement Learning To No-regret Online Learning (2019)0.00
- The Best Of Both Worlds: Reinforcement Learning With Logarithmic Regret And Policy Switches (2022)0.00
- Breaking The Sample Complexity Barrier To Regret-optimal Model-free Reinforcement Learning (2021)0.00
- Online Reinforcement Learning In Non-stationary Context-driven Environments (2023)0.00
- Conservative Optimistic Policy Optimization Via Multiple Importance Sampling (2021)0.00