On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
This work provides a theoretical foundation for understanding when and why CoT helps or hurts in reasoning tasks, addressing a key gap in the literature.
The paper develops a learning-theoretic framework for Chain of Thought (CoT), decomposing its risk into oracle-trajectory and trajectory-mismatch terms. It shows that without stability, CoT's cost can be arbitrarily large, while under stability, error growth is bounded and characterized by an amplification factor.
We develop a learning-theoretic framework for understanding Chain of Thought (CoT). We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. We then show that this cost is unavoidable without structure: if any one of the loss, the hypothesis answer map, or the chain rule lacks stability, the TMR can be arbitrarily large even when the OTR is zero and the hypothesis is uniformly close to the ground truth. Conversely, under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear, and exponential error-growth regimes. Together, these results give a precise theory of when CoT helps, when it hurts, and what controls the transition between the two.