Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, the actions of all \(N\) agents jointly determine each agent's learning signal, so cross-agent noise grows with \(N\). In the policy gradient setting, per-agent gradient estimate variance scales as \(\Theta(N)\), yielding sample complexity \(\mathcal\{O\}(N/\epsilon)\). We observe that many domains, including cloud computing, transportation, and power systems, have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that utilizes these analytical models to provide each agent with a noise-free gradient signal, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from \(\Theta(N)\) to \(\mathcal\{O\}(1)\), preserves the equilibria of the cooperative game, and achieves agent-independent sample

Authors

(none)

Tags

  • Multi-Agent
  • Policy Gradient

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyyang2026descent

Related papers