Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

William L. Tong, Ege Cakar, Cengiz Pehlevan

arXiv:2602.14404v14.41 citations

Originality Incremental advance

AI Analysis

This work identifies fundamental benefits and limitations of reasoning traces for improving robust reasoning in AI, though it is incremental in advancing understanding of task topology effects.

The study investigated how reasoning traces (RTs) aid neural networks in length generalization for logical reasoning tasks, using a large dataset (PITA) and synthetic syllogisms, finding that RT models generalize well on broad and shallow tasks but deteriorate on narrow and deep ones compared to non-RT baselines.

Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.

View on arXiv PDF

Similar