Open-World Evaluations for Measuring Frontier AI Capabilities
For AI safety researchers and developers, this paper proposes a complementary evaluation method to benchmarks, though it is an incremental methodological proposal rather than a breakthrough.
The authors argue that benchmark-based evaluations can misrepresent frontier AI capabilities and propose open-world evaluations—long-horizon, real-world tasks assessed qualitatively. They introduce CRUX and demonstrate it with an AI agent that developed and published an iOS app with only one manual intervention, suggesting open-world evals can provide early warnings of emerging capabilities.
Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.