Descent-guided Policy Gradient For Scalable Cooperative Multi-agent Learning

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, the actions of all \(N\) agents jointly determine each agent's learning signal, so cross-agent noise grows with \(N\). In the policy gradient setting, per-agent gradient estimate variance scales as \(\Theta(N)\), yielding sample complexity \(\mathcal\{O\}(N/\epsilon)\). We observe that many domains, including cloud computing, transportation, and power systems, have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that utilizes these analytical models to provide each agent with a noise-free gradient signal, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from \(\Theta(N)\) to \(\mathcal\{O\}(1)\), preserves the equilibria of the cooperative game, and achieves agent-independent sample

Descent-guided Policy Gradient For Scalable Cooperative Multi-agent Learning

Abstract

Authors

Tags

Stats

Related papers