AISep 10, 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ryan Lagasse, Vasu Sharma, Ashwinee Panda

arXiv:2509.13333v215.67 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses AI safety by enabling forecasting of deceptive behavior in larger models and guiding scale-aware evaluation strategies, though it is incremental as it extends prior findings on a single model.

The study investigated how evaluation awareness, where large language models distinguish between testing and deployment contexts, scales with model size across 15 models from 0.27B to 70B parameters, revealing a clear power-law scaling that increases predictably with size.

Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.

View on arXiv PDF

Similar