When Chain-of-Thought Fails, the Solution Hides in the Hidden States

Houman Mehrafarin, Amit Parekh, Ioannis Konstas

arXiv:2604.2335191.6h-index: 1

Predicted impact top 26% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers studying mechanistic interpretability of reasoning in LLMs, this work provides causal evidence that CoT tokens encode recoverable information, offering insights into how reasoning is represented and where it fails.

The paper shows that chain-of-thought tokens encode task-relevant information that can be recovered via activation patching, achieving higher accuracy than direct-answer prompting and even the original CoT trace on GSM8K. This reveals that CoT tokens contain recoverable problem-solving information, with correct runs having more such information concentrated in mid-to-late layers and earlier in the trace.

Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace. Moreover, patching language tokens such as verbs and entities carry task-solving information that steers generation toward correct reasoning, whereas mathematical tokens encode answer-proximal content that rarely succeeds. Patched outputs are often shorter and yet exceed the accuracy of a full CoT trace, suggesting complete reasoning chains are not always necessary. Together, these findings demonstrate that CoT encodes recoverable, token-level problem-solving information, offering new insight into how reasoning is represented and where it breaks down.

View on arXiv PDF

Similar