The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

arXiv:2606.0264696.0

AI Analysis

This work provides a theoretical framework and practical tool for determining the optimal team size in multi-agent LLM systems, addressing a key bottleneck in inference-time scaling for practitioners.

The authors derive a two-parameter scaling law for multi-agent LLM systems that predicts effective team size, showing that beyond a small number of agents, adding more agents yields diminishing returns. Across 44 experimental conditions, the law fits with R² > 0.99, revealing that dense peer debate collapses answer diversity to a hard ceiling, and that gains attributed to debate are largely due to re-evaluation rather than peer content.

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($β= 0$), sublinear at $N^β/c$ ($0 < β< 1$), or linear ($β\ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $τ$ during agent debate enter the dynamics only through their product $kτ$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, β)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.

View on arXiv PDF

Similar