AICRNov 28, 2025

Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?

arXiv:2512.00218v24 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the problem of ensuring AI safety through monitoring for researchers, but the results are incremental as they show limited effects from common incentives.

The study investigated how different training incentives affect the monitorability of chain-of-thought reasoning in AI systems, finding that adversarial optimization degrades monitor performance while direct optimization does not reliably improve it.

AI systems that output their reasoning in natural language offer an opportunity for safety -- we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes