Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
This work addresses the problem of auditing AI safety for researchers and practitioners by clarifying when monitorability gains occur, though it is incremental as it builds on prior observations of RLVR effects.
The study investigated how monitorability, the faithfulness of chain-of-thought traces in Large Reasoning Models, emerges during Reinforcement Learning with Verifiable Rewards (RLVR), finding that improvements are data-dependent and orthogonal to reasoning capability, with gains attributed to response distribution sharpening and increased attention to prompts.
As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.