Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity
This addresses trustworthiness issues in high-stakes domains by making models more reliable, though it is an incremental improvement over existing fine-tuning and process supervision methods.
The paper tackles the problem of unfaithful reasoning in large language models, where correct answers may arise from flawed reasoning traces, by introducing Counterfactual Sensitivity Regularization (CSR) to enforce a causal-like dependence between outputs and intermediate steps, resulting in up to 70 percentage point improvements in faithfulness across benchmarks like GSM8K and HotpotQA.
The reasoning processes of large language models often lack faithfulness; a model may generate a correct answer while relying on a flawed or irrelevant reasoning trace. This behavior, a direct consequence of training objectives that solely reward final-answer correctness, severely undermines the trustworthiness of these models in high-stakes domains. This paper introduces \textbf{Counterfactual Sensitivity Regularization (CSR)}, a novel training objective designed to forge a strong, causal-like dependence between a model's output and its intermediate reasoning steps. During training, CSR performs automated, operator-level interventions on the generated reasoning trace (e.g., swapping ``+'' with ``-'') to create a minimally-perturbed counterfactual. A regularization term then penalizes the model if this logically flawed trace still yields the original answer. Our efficient implementation adds only 8.7\% training overhead through warm-start curriculum and token-subset optimization. We evaluate faithfulness using \textbf{Counterfactual Outcome Sensitivity (COS)}, a metric quantifying how sensitive the final answer is to such logical perturbations. Across diverse structured reasoning benchmarks -- arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop QA (HotpotQA), and code generation (MBPP) -- models trained with CSR demonstrate a vastly superior trade-off between accuracy and faithfulness. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, with this learned sensitivity generalizing to larger models and enhancing the performance of inference-time techniques like self-consistency.