LGAIMLJun 10, 2024

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

arXiv:2406.06467v343 citations
Originality Incremental advance
AI Analysis

This addresses the problem of enabling Transformers to perform complex reasoning tasks more efficiently and generalize better, which is crucial for AI systems requiring logical inference, though it is incremental in proposing specific techniques within existing frameworks.

The paper investigates the limitations of Transformers in learning to compose reasoning steps, such as syllogisms, by introducing the 'globality degree' concept to show that high-globality distributions cannot be learned efficiently, and proposes 'inductive scratchpads' that break this barrier and improve out-of-distribution generalization, achieving up to 6× length generalization in arithmetic tasks.

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by $TC^0/TC^1$ classes (further studied here), since the globality relates to correlations with the more limited $NC^0$ class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to $6\times$ for some arithmetic tasks depending on the input formatting.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes