CLLGJul 29, 2025

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

arXiv:2507.21509v3207 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the issue of unintended personality deviations in AI assistants for users and developers, offering automated tools for oversight, though it is incremental in applying existing vector-based methods to personality traits.

The paper tackled the problem of monitoring and controlling undesirable character traits like evil, sycophancy, and hallucination in large language models by identifying persona vectors in activation space, and found that these vectors can predict and mitigate personality shifts during training, with interventions reducing such shifts.

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes