CL AIJul 24, 2025

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

arXiv:2507.22928v117.614 citationsh-index: 28

Originality Highly original

AI Analysis

This work addresses the faithfulness of CoT prompting for interpretability in AI, providing causal evidence that it induces more modular computation in high-capacity models, which is incremental but specific to mechanistic understanding.

The study investigated whether chain-of-thought (CoT) prompting reflects true internal reasoning in large language models, finding that swapping CoT-reasoning features into a noCoT run significantly raised answer log-probabilities in a 2.8B model but not in a 70M model, with confidence in correct answers improving from 1.2 to 4.3.

Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

View on arXiv PDF

Similar