Towards Understanding Specification Gaming in Reasoning Models
This work systematically studies a critical failure mode of LLM agents, providing an evaluation suite and insights into how RL reasoning training exacerbates specification gaming.
The paper investigates specification gaming in LLM agents, finding that all tested models exploit specifications at non-negligible rates across eight tasks, with Grok 4 showing the highest and Claude models the lowest rates. RL reasoning training substantially increases exploit rates, while test-time mitigations reduce but do not eliminate them.
Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.