Contextual Bandits And Optimistically Universal Learning
2022 Β· Moise Blanchard, Steve Hanneke, Patrick Jaillet
Abstract
We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency -- vanishing regret compared to the optimal policy -- and show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism, a property known as universal consistency. Precisely, we first give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Second, we show that there always exists an algorithm that guarantees universal consistency whenever this is achievable, called an optimistically universal learning rule. Interestingly, for finite action spaces, learnable processes for universal learnin
Authors
(none)
Tags
Stats
Related papers
- Anytime-valid Off-policy Inference For Contextual Bandits (2022)2.26
- Inverse Contextual Bandits Without Rewards: Learning From A Non-stationary Learner Via Suffix Imitation (2026)0.00
- Learning Without Knowing: Unobserved Context In Continuous Transfer Reinforcement Learning (2021)0.00
- A New Bandit Setting Balancing Information From State Evolution And Corrupted Context (2020)0.00
- Bandit Social Learning: Exploration Under Myopic Behavior (2023)0.00
- Local Metric Learning For Off-policy Evaluation In Contextual Bandits With Continuous Actions (2022)0.00
- Multi-action Restless Bandits With Weakly Coupled Constraints: Simultaneous Learning And Control (2024)0.00
- Unified Models Of Human Behavioral Agents In Bandits, Contextual Bandits And RL (2020)8.35