Optimal Cooperative Multiplayer Learning Bandits With Noisy Rewards And No Communication
2023 Β· William Chang, Yuanhao Lu
Abstract
We consider a cooperative multiplayer bandit learning problem where the players are only allowed to agree on a strategy beforehand, but cannot communicate during the learning process. In this problem, each player simultaneously selects an action. Based on the actions selected by all players, the team of players receives a reward. The actions of all the players are commonly observed. However, each player receives a noisy version of the reward which cannot be shared with other players. Since players receive potentially different rewards, there is an asymmetry in the information used to select their actions. In this paper, we provide an algorithm based on upper and lower confidence bounds that the players can use to select their optimal actions despite the asymmetry in the reward information. We show that this algorithm can achieve logarithmic \(O(\frac\{log T\}\{\Delta_\{\bm\{a\}\}\})\) (gap-dependent) regret as well as \(O(\sqrt\{Tlog T\})\) (gap-independent) regret. This is asymptotica
Authors
(none)
Tags
Stats
Related papers
- Online Learning For Cooperative Multi-player Multi-armed Bandits (2021)5.24
- No-regret Learning In Unknown Games With Correlated Payoffs (2019)0.00
- Near-optimal Collaborative Learning In Bandits (2022)0.00
- Learning To Coordinate Under Threshold Rewards: A Cooperative Multi-agent Bandit Framework (2025)0.00
- Adversarial Learning In Games With Bandit Feedback: Logarithmic Pure-strategy Maximin Regret (2026)0.00
- The Price Of Paranoia: Robust Risk-sensitive Cooperation In Non-stationary Multi-agent Reinforcement Learning (2026)0.00
- A Black-box Approach For Non-stationary Multi-agent Reinforcement Learning (2023)0.00
- Bandit Social Learning: Exploration Under Myopic Behavior (2023)0.00