Learning To Coordinate Under Threshold Rewards: A Cooperative Multi-agent Bandit Framework
2025 Β· Michael Ledford, William Regli
Abstract
Cooperative multi-agent systems often face tasks that require coordinated actions under uncertainty. While multi-armed bandit (MAB) problems provide a powerful framework for decentralized learning, most prior work assumes individually attainable rewards. We address the challenging setting where rewards are threshold-activated: an arm yields a payoff only when a minimum number of agents pull it simultaneously, with this threshold unknown in advance. Complicating matters further, some arms are decoys - requiring coordination to activate but yielding no reward - introducing a new challenge of wasted joint exploration. We introduce Threshold-Coop-UCB (T-Coop-UCB), a decentralized algorithm that enables agents to jointly learn activation thresholds and reward distributions, forming effective coalitions without centralized control. Empirical results show that T-Coop-UCB consistently outperforms baseline methods in cumulative reward, regret, and coordination metrics, achieving near-Oracle per
Authors
(none)
Tags
Stats
Related papers
- Online Learning For Cooperative Multi-player Multi-armed Bandits (2021)5.24
- Optimal Cooperative Multiplayer Learning Bandits With Noisy Rewards And No Communication (2023)0.00
- Near-optimal Collaborative Learning In Bandits (2022)0.00
- Coordinated Exploration Via Intrinsic Rewards For Multi-agent Reinforcement Learning (2019)0.00
- Learning Reward Functions For Cooperative Resilience In Multi-agent Systems (2026)0.00
- Signal Instructed Coordination In Cooperative Multi-agent Reinforcement Learning (2019)4.52
- Fully Decentralized Cooperative Multi-agent Reinforcement Learning: A Survey (2024)0.00
- Multi-action Restless Bandits With Weakly Coupled Constraints: Simultaneous Learning And Control (2024)0.00