A Closer Look At The Worst-case Behavior Of Multi-armed Bandit Algorithms
2021 Β· Anand Kalvit, Assaf Zeevi
Abstract
One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O(log n) regret in instances with "large" gaps, and a near-optimal O(\sqrt\{n log n\}) minimax regret when the gap can be arbitrarily "small." This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that arm-sampling rates under UCB are asymptotically deterministic, regardless of the problem complexity. This discovery facilitates new sharp asymptotics and a novel alternative proof for the O(\sqrt\{n log n\}) minimax regret of UCB. Furthermore, the paper also provides the first complete process-level characterization of the MAB problem under U
Authors
(none)
Tags
Stats
Related papers
- A Frequency-domain Analysis Of The Multi-armed Bandit Problem: A New Perspective On The Exploration-exploitation Trade-off (2025)0.00
- Online Learning For Cooperative Multi-player Multi-armed Bandits (2021)5.24
- Stochastic Multi-armed Bandits With Limited Control Variates (2026)0.00
- One Good Source Is All You Need: Near-optimal Regret For Bandits Under Heterogeneous Noise (2026)0.00
- Non-stationary Latent Auto-regressive Bandits (2024)0.00
- Mean Field Equilibrium In Multi-armed Bandit Game With Continuous Reward (2021)0.00
- Unified Framework Of Distributional Regret In Multi-armed Bandits And Reinforcement Learning (2026)0.00
- Trading Off Rewards And Errors In Multi-armed Bandits (2026)0.00