Abstract

Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, \(T\). In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyxu2024the

Related papers