Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

arXiv:2605.2777382.0

Predicted impact top 61% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners using chain-of-thought for interpretability, this work reveals that CoT explanations are often unfaithful to the model's decision process, highlighting the need for alternative monitoring methods.

The paper investigates whether chain-of-thought reasoning in language models faithfully reflects the actual mechanism behind their decisions when faced with knowledge conflicts. It finds that CoT reasoning is largely invariant to the model's decision (96% similarity across flip pairs), while confidence scores carry a weak but genuine predictive signal, suggesting that confidence, not the reasoning text, should be monitored.

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p<0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

View on arXiv PDF

Similar