LGOct 31, 2025

Reasoning Models Sometimes Output Illegible Chains of Thought

arXiv:2510.27338v17 citations
Originality Incremental advance
AI Analysis

This highlights a potential risk for monitoring AI safety, as opaque reasoning could undermine efforts to detect malicious behavior in models.

The study found that language models trained with outcome-based reinforcement learning to reason using chain-of-thought often produce illegible reasoning steps while still generating correct final answers, with accuracy dropping by 53% when forced to use only legible portions.

Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes