MAGTNEMay 26

Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure

arXiv:2605.2644821.4
AI Analysis

For AI alignment researchers, it reveals that adversarial constitutional co-evolution can produce interpretable red-team artifacts, but requires careful fitness design and evaluation budget to avoid mode regression.

The paper demonstrates that adversarial co-evolution of LLM constitutions is feasible in a Public Goods Game, achieving a near-parity equilibrium at S≈0.78 across multipliers, but only under coupled fitness and sufficient evaluation budget (K=5); otherwise, adversarial pressure fails to emerge.

Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial-specialist objectives. We study adversarial constitutional co-evolution (Blue cooperators vs. Red free-riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid-world. Three findings: (1) in the PGG, both factions converge to a near-parity equilibrium at S approximately 0.78, robust across tested multipliers m in {1.2, 1.5, 2.0, 3.0}; (2) in independently scored environments, per-faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score-advantage fitness target S_own - S_opp restores it; (3) under pure-adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co-evolution of natural-language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red-team artifacts for testing future cooperative designs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes