Hershraj Niranjani

21.4MAMay 26

Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure

Ujwal Kumar, Arth Singh, Hershraj Niranjani et al.

Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial-specialist objectives. We study adversarial constitutional co-evolution (Blue cooperators vs. Red free-riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid-world. Three findings: (1) in the PGG, both factions converge to a near-parity equilibrium at S approximately 0.78, robust across tested multipliers m in {1.2, 1.5, 2.0, 3.0}; (2) in independently scored environments, per-faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score-advantage fitness target S_own - S_opp restores it; (3) under pure-adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co-evolution of natural-language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red-team artifacts for testing future cooperative designs.

40.1MAMay 9

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

Hershraj Niranjani, Ujwal Kumar, Phan Xuan Tan

Multi-agent AI systems need behavioral constitutions, but it is unresolved whether such rules should emerge internally through agent self-governance or be discovered externally through optimization. We present the first controlled comparison of internal deliberation and external evolution across three social environments: a coordination grid-world, an iterated public goods game, and a bilateral trading market. Across 180 simulation runs, evolution significantly outperforms deliberation in collective-action settings (p < 0.01), while neither method improves outcomes in bilateral trading. A multiplier ablation reveals that evolution's advantage inverts when incentives shift: at pool multiplier (m = 0.75) the evolved constitution forces value-destroying cooperation and becomes the worst-performing method. Notably, no deliberation run across thirty trials ever proposed punishment -- the canonical cooperation-sustaining mechanism evolution reliably discovers -- suggesting external optimization wins on peaks while internal self-governance trades peaks for structural responsiveness.

Hershraj Niranjani

2 Papers