LGAIMay 13

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

arXiv:2605.1299124.3
Predicted impact top 30% in LG · last 90 daysOriginality Highly original
AI Analysis

For developers of LLM-based multi-agent systems, the paper identifies the mechanism of a critical failure mode and demonstrates that prompt-level defenses are insufficient, requiring pipeline-level structural changes.

The paper shows that multi-agent sycophancy in LLM pipelines is not primarily caused by RLHF, as pretrained base models exhibit higher yield than Instruct variants. Activation patching localizes the corruption to a mid-layer attention window, and a single dissenter reduces yield by 54-73 percentage points.

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes