CL AI LGJul 21, 2024

When Can Transformers Count to n?

Gilad Yehudai, Haim Kaplan, Guy Dar, Royi Rassin, Asma Ghandeharioun, Mor Geva, Amir Globerson

DeepMind

arXiv:2407.15160v311.521 citationsh-index: 32

Originality Incremental advance

AI Analysis

This work addresses a critical blind spot in understanding transformer limitations for simple algorithmic problems, which is incremental but important for model design and evaluation.

The paper investigates the fundamental limitations of transformer-based language models on basic counting tasks, revealing a sharp theoretical phase transition where exact counting becomes unlearnable when vocabulary size exceeds embedding dimension, and empirically validates this bottleneck with performance drops at the threshold.

Large language models based on the transformer architecture can solve highly complex tasks, yet their fundamental limitations on simple algorithmic problems remain poorly understood. In this work, we focus on basic counting tasks and investigate how the difficulty of these tasks scales with the transformer embedding dimension, the context length, and the vocabulary size. We reveal a sharp theoretical phase transition governed by the relationship between the embedding dimension and the vocabulary size. When the dimension is at least as large as the vocabulary, transformers can perfectly maintain token counts. However, when the vocabulary exceeds the embedding dimension, the interference between non-orthogonal token representations forces the network weights to scale polynomially. This renders the exact counting algorithm numerically unstable and practically unlearnable. We empirically validate this bottleneck by training transformers from scratch, demonstrating a strict performance drop at the theoretical threshold and catastrophic out of distribution failure when scaling the vocabulary or context length. Furthermore, we show that state-of-the-art pretrained models suffer from similar failure cases. Our work reveals a critical blind spot absent from the current literature regarding the connection among these three parameters, proving that vocabulary size fundamentally dictates the difficulty of counting tasks.

View on arXiv PDF

Similar