Tiered Reinforcement Learning: Pessimism In The Face Of Uncertainty And Constant Regret
2022 Β· Jiawei Huang, Li Zhao, Tao Qin, et al.
Abstract
We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies \(\pi^\{\text\{O\}\}\) and \(\pi^\{\text\{E\}\}\): \(\pi^\{\text\{O\}\}\) ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while \(\pi^\{\text\{E\}\}\) ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., \(\pi^\{\text\{E\}\}=\pi^\{\text\{O\}\}\)) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separati
Authors
(none)
Tags
Stats
Related papers
- Accommodating Picky Customers: Regret Bound And Exploration Complexity For Multi-objective Reinforcement Learning (2020)0.00
- Epistemic Risk-sensitive Reinforcement Learning (2019)0.00
- Local Differential Privacy For Regret Minimization In Reinforcement Learning (2020)0.00
- Optimistic Policy Learning Under Pessimistic Adversaries With Regret And Violation Guarantees (2026)0.00
- Reinforcement Learning With Human Feedback: Learning Dynamic Choices Via Pessimism (2023)0.00
- DOPE: Doubly Optimistic And Pessimistic Exploration For Safe Reinforcement Learning (2021)0.00
- The Best Of Both Worlds: Reinforcement Learning With Logarithmic Regret And Policy Switches (2022)0.00
- Bridging Offline Reinforcement Learning And Imitation Learning: A Tale Of Pessimism (2021)0.00