CLLGFeb 3, 2025

Emergent Stack Representations in Modeling Counter Languages Using Transformers

arXiv:2502.01432v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work incrementally advances the understanding of transformer inner workings for researchers in machine learning and formal language theory.

The researchers tackled the problem of understanding how transformers learn by training them on counter languages, which can be modeled with stacks, and found that the models learn stack-like representations when predicting next tokens, as shown by probing internal representations for stack depths.

Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by isolating the learning capabilities of these architectures by training them over well-understood classes of formal languages. We extend this literature by analyzing models trained over counter languages, which can be modeled using counter variables. We train transformer models on 4 counter languages, and equivalently formulate these languages using stacks, whose depths can be understood as the counter values. We then probe their internal representations for stack depths at each input token to show that these models when trained as next token predictors learn stack-like representations. This brings us closer to understanding the algorithmic details of how transformers learn languages and helps in circuit discovery.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes