AI LGMay 29, 2025

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

arXiv:2505.23575v223 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the problem of AI safety for deployment scenarios by proposing a monitoring method to detect harmful actions, though it is incremental as it builds on existing CoT and monitoring concepts.

The paper tackles the problem of ensuring AI models do not take harmful actions unnoticed by investigating Chain-of-Thought (CoT) monitoring, where a weaker model oversees the reasoning steps of a more powerful untrusted model, and finds that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails, but introduces a hybrid protocol that consistently outperforms both with detection rates over four times higher for subtle deception.

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage. However, CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores both reasoning and final outputs and combines them using a weighted average. This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.

View on arXiv PDF

Similar