How Does Unfaithful Reasoning Emerge from Autoregressive Training? A Study of Synthetic Experiments
This addresses the problem of unreliable reasoning in AI for researchers and practitioners, though it is incremental as it builds on existing empirical observations with controlled experiments.
The paper investigates how unfaithful chain-of-thought reasoning emerges in language models through synthetic experiments on modular arithmetic, finding that faithful reasoning occurs only below a critical noise threshold due to simplicity bias, while higher noise leads to skip-step reasoning with a transient entropy increase.
Chain-of-thought (CoT) reasoning generated by large language models (LLMs) is often unfaithful: intermediate steps can be logically inconsistent or fail to reflect the causal relationship leading to the final answer. Despite extensive empirical observations, a fundamental understanding of CoT is lacking--what constitutes faithful CoT reasoning, and how unfaithfulness emerges from autoregressive training. We study these questions using well-controlled synthetic experiments, training small transformers on noisy data to solve modular arithmetic expressions step by step, a task we term Arithmetic Expression Reasoning. We find that models can learn faithful reasoning that causally follows the underlying arithmetic rules, but only when the training noise is below a critical threshold, a phenomenon attributable to simplicity bias. At higher noise levels, training dynamics exhibit a transition from faithful stepwise reasoning to unfaithful skip-step reasoning via an intermediate mixed mode characterized by a transient increase in prediction entropy. Mechanistic analysis reveals that models learn to encode internal uncertainty by resolving inconsistent reasoning steps, which suggests the emergence of implicit self-verification from autoregressive training.