AIMay 12

Classifier Context Rot: Monitor Performance Degrades with Context Length

arXiv:2605.1236664.8

AI Analysis

For AI safety researchers evaluating agent monitors, this reveals a critical blind spot in current benchmarks that underestimate performance degradation at long context lengths.

The paper shows that frontier language models used as classifiers for monitoring coding agents fail to detect dangerous actions more often in longer transcripts (up to 30x more misses after 800K tokens), and that prompting techniques like periodic reminders can partially mitigate this degradation.

Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on a dataset that requires identifying when a coding agent takes a subtly dangerous action, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions $2\times$ to $30\times$ more often when they occur after 800K tokens of benign activity than when they occur on their own. We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training. Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.

View on arXiv PDF

Similar