CLAILGJul 21, 2024

When Can Transformers Count to n?

DeepMind
arXiv:2407.15160v321 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses a critical blind spot in understanding transformer limitations for simple algorithmic problems, which is incremental but important for model design and evaluation.

The paper investigates the fundamental limitations of transformer-based language models on basic counting tasks, revealing a sharp theoretical phase transition where exact counting becomes unlearnable when vocabulary size exceeds embedding dimension, and empirically validates this bottleneck with performance drops at the threshold.

Large language models based on the transformer architecture can solve highly complex tasks, yet their fundamental limitations on simple algorithmic problems remain poorly understood. In this work, we focus on basic counting tasks and investigate how the difficulty of these tasks scales with the transformer embedding dimension, the context length, and the vocabulary size. We reveal a sharp theoretical phase transition governed by the relationship between the embedding dimension and the vocabulary size. When the dimension is at least as large as the vocabulary, transformers can perfectly maintain token counts. However, when the vocabulary exceeds the embedding dimension, the interference between non-orthogonal token representations forces the network weights to scale polynomially. This renders the exact counting algorithm numerically unstable and practically unlearnable. We empirically validate this bottleneck by training transformers from scratch, demonstrating a strict performance drop at the theoretical threshold and catastrophic out of distribution failure when scaling the vocabulary or context length. Furthermore, we show that state-of-the-art pretrained models suffer from similar failure cases. Our work reveals a critical blind spot absent from the current literature regarding the connection among these three parameters, proving that vocabulary size fundamentally dictates the difficulty of counting tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes