Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
This solves the scalability bottleneck in cooperative multi-agent systems for domains like cloud computing and transportation, though it is incremental as it builds on existing policy gradient methods with analytical models.
The paper tackles the problem of scaling cooperative multi-agent reinforcement learning by addressing cross-agent noise that grows with the number of agents, proposing Descent-Guided Policy Gradient (DG-PG) which reduces gradient variance from Θ(N) to O(1) and achieves agent-independent sample complexity O(1/ε), demonstrated by convergence within 10 episodes for up to 200 agents in a cloud scheduling task.
Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.