AIMay 10

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Xiangkun Sun, Lingkai Kong, Aoqi Zhang, Liang Zeng, Tonghan Wang

arXiv:2605.0931488.01 citationsHas Code

AI Analysis

This work provides a mechanistic understanding of a key AI safety vulnerability (persuasion) for LLM developers and safety researchers, revealing it as a narrow, monitorable circuit.

The paper identifies a compact causal mechanism for persuasion-induced factual errors in LLMs, where a small set of attention heads redirects attention to cause discrete jumps between answer vertices, and demonstrates that modifying a rank-one evidence-routing feature can steer or block persuasion across multiple models and realistic scenarios.

Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.

View on arXiv PDF

Similar