Unified Framework Of Distributional Regret In Multi-armed Bandits And Reinforcement Learning

Abstract

arXiv:2605.05102v1 Announce Type: new Abstract: We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels \(\delta \in (0,1]\), thereby characterizing the regret distribution across the full range of \(\delta\). We present a simple UCBVI-style algorithm with exploration bonus \(\min\\{c_\{1,k\}/N, c_\{2,k\}/\sqrt\{N\}\\}\), where \(N\) denotes the visit count and \((c_\{1,k\},c_\{2,k\})\) are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both m

Abstract

Related papers