Abstract

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies \(\pi^\{\text\{O\}\}\) and \(\pi^\{\text\{E\}\}\): \(\pi^\{\text\{O\}\}\) ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while \(\pi^\{\text\{E\}\}\) ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., \(\pi^\{\text\{E\}\}=\pi^\{\text\{O\}\}\)) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separati

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyhuang2022tiered

Related papers