LGJul 16, 2024

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

arXiv:2407.11542v35 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding architectural design choices in transformers for researchers, but it is incremental as it focuses on a basic task with minor modifications.

The paper investigates how small transformers solve the histogram task (counting items in sequences), identifying two strategies—relation-based and inventory-based counting—that distribute functionality between attention and feed-forward layers, with empirical results confirming these regimes and showing robustness improvements from design tweaks like softmax and special tokens.

Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes