AI CLMay 4

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Lui, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

arXiv:2605.0275170.8

Predicted impact top 49% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For developers deploying multiple language models in multi-agent settings, this work identifies and mitigates the risk of misalignment spreading between models.

Language models become more anti-social after multi-turn interactions in social dilemma games, a phenomenon called misalignment contagion. The authors propose steering with implicit traits, which intermittently injects system prompts with statements reinforcing initial traits, and show it is more effective than system prompt repetition at maintaining pro-social behavior.

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.

View on arXiv PDF

Similar