CLJun 5

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

arXiv:2602.1120114.37 citations
Originality Incremental advance
AI Analysis

For researchers using CoT explanations to interpret language models, this work provides a method to detect when explanations are unfaithful, revealing a critical limitation in current interpretability practices.

The paper proposes NLDD, a metric to measure faithfulness of chain-of-thought reasoning steps in language models, and finds that beyond 70-85% of chain length, reasoning tokens have little or negative effect on the final answer, showing that accuracy alone does not indicate genuine reasoning.

Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes