AI LGSep 10, 2025

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness

Anand Swaroop, Akshat Nallani, Saksham Uboweja, Adiliia Uzdenova, Michael Nguyen, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma, Maheep Chaudhary

arXiv:2509.13334v14 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses the critical gap between reasoning performance and trustworthiness in AI systems, particularly for users relying on interpretable outputs in complex tasks, though it is incremental as it builds on existing CoT and alignment techniques.

The paper tackles the problem of unfaithful reasoning in chain-of-thought (CoT) prompting for large language models, where reasoning steps often fail to causally influence final answers, and introduces FRIT, a scalable alignment method that increases faithful reasoning by 3.4 percentage points and accuracy by 7.6 percentage points on GSM8K for Mistral-7B-v0.1.

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs. Prior approaches focus primarily on measuring faithfulness, while methods for systematically improving it remain limited. We introduce Faithful Reasoning via Intervention Training (FRIT), a scalable alignment method that trains models to produce causally consistent reasoning by learning from systematically corrupted examples. FRIT generates synthetic training data by intervening on individual reasoning steps in model-generated CoTs, creating faithful/unfaithful pairs that highlight when reasoning breaks down. We then apply Direct Preference Optimization to teach models to prefer causally consistent reasoning paths. Evaluating on Qwen3-8B and Mistral-7B-v0.1 across factual and symbolic reasoning tasks, FRIT increases faithful reasoning by $3.4$ percentage points for Mistral on GSM8K while improving accuracy by $7.6$ percentage points. Our approach provides the first scalable, supervision-free method for training language models to produce more reliable and interpretable reasoning, addressing a critical gap between reasoning performance and trustworthiness. We release our code at \href{https://github.com/Anut-py/frit}.

View on arXiv PDF Code

Similar