LGMay 2, 2025

Evaluating Frontier Models for Stealth and Situational Awareness

Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah

arXiv:2505.01420v428.127 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses AI safety concerns for developers by providing tools to rule out harmful scheming before deployment, though it is incremental as it builds on prior work on model evaluation.

The paper tackles the problem of detecting scheming behavior in frontier AI models by proposing a suite of evaluations for stealth and situational awareness, finding that current models show no concerning levels of these capabilities.

Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.

View on arXiv PDF

Similar