AI CLFeb 24

Counterfactual Simulation Training for Chain-of-Thought Faithfulness

arXiv:2602.20710v112.89 citationsh-index: 48Has Code

Originality Incremental advance

AI Analysis

This addresses the issue of unreliable reasoning insights from CoT for AI researchers and practitioners, though it is incremental as it builds on existing CoT monitoring methods.

The paper tackles the problem of Chain-of-Thought (CoT) faithfulness in large language models by introducing Counterfactual Simulation Training (CST), which improves monitor accuracy by 35 points and simulatability by 2 points in experiments with models up to 235B parameters.

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training

View on arXiv PDF Code

Similar