Scout: Scalable Communication Via Utility-guided Temporal Grouping In Multi-agent Reinforcement Learning

·2026

arXiv:vora2026scout ↗Google Scholar ↗Semantic Scholar ↗

Abstract

Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning *when* and *who* to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf\{SCoUT\} (\textbf\{S\}calable \textbf\{Co\}mmunication via \textbf\{U\}tility-guided \textbf\{T\}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit\{soft\} agent groups every \(K\) environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed pol

Abstract

Related papers