Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
This work addresses the problem of evaluating faithfulness in CoT explanations for AI researchers, cautioning against relying solely on hint-based metrics and advocating for a broader interpretability toolkit.
The paper challenges the Biasing Features metric for labeling Chain-of-Thought (CoT) reasoning as unfaithful, arguing it confuses unfaithfulness with incompleteness due to token limits, and shows that many CoTs flagged as unfaithful are judged faithful by other metrics, with larger token budgets increasing hint verbalization up to 90%.
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.