Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
This work addresses the critical problem of ensuring AI safety by exposing vulnerabilities in current alignment auditing methods, which is incremental but important for researchers and practitioners in AI alignment.
The authors stress-tested alignment audits by developing an automatic red-team pipeline that generates deceptive prompts, finding that both black-box and white-box methods were deceived into confident, incorrect guesses, with evidence of activation-based strategic deception.
Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.