AI LGFeb 5

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow

arXiv:2602.05532v17.52 citationsh-index: 7

Originality Highly original

AI Analysis

This addresses the problem of auditing AI safety for researchers and developers, offering a novel approach to reveal hidden misalignment, though it is incremental as it builds on existing fine-tuning techniques.

The paper tackles the challenge of detecting misalignment in large language models, which can conceal misbehavior, by introducing Split Personality Training (SPT) that fine-tunes an honest persona to review responses, achieving 96% accuracy on a benchmark where standard methods fail.

Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.

View on arXiv PDF

Similar