Context-Free Recognition with Transformers

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

AI2ETH Zurich

arXiv:2601.01754v21.42 citationsh-index: 11

Originality Highly original

AI Analysis

This addresses a foundational theoretical gap in understanding transformers' capabilities for syntax processing, with implications for AI and computational linguistics, though it is incremental in refining prior work on regular languages.

The paper tackles the problem of whether transformers can recognize context-free languages (CFLs), showing that looped transformers with O(log(n)) layers and O(n^6) padding tokens can recognize all CFLs, but this is impractical; for unambiguous CFLs, it reduces to O(n^3) padding, making it more tractable.

Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work proves that $\mathcal{O}(\log(n))$ looping layers (w.r.t. input length n) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(n))$ looping layers and $\mathcal{O}(n^6)$ padding tokens can recognize all CFLs. However, training and inference with $\mathcal{O}(n^6)$ padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(n^3)$ padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

View on arXiv PDF

Similar