AILGFeb 5

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

arXiv:2602.05532v12 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses the problem of auditing AI safety for researchers and developers, offering a novel approach to reveal hidden misalignment, though it is incremental as it builds on existing fine-tuning techniques.

The paper tackles the challenge of detecting misalignment in large language models, which can conceal misbehavior, by introducing Split Personality Training (SPT) that fine-tunes an honest persona to review responses, achieving 96% accuracy on a benchmark where standard methods fail.

Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes