LG AIFeb 26

Transformers converge to invariant algorithmic cores

arXiv:2602.22600v1h-index: 7

Originality Highly original

AI Analysis

This work is significant for researchers in mechanistic interpretability, as it proposes targeting invariant computational essences rather than implementation-specific details to better understand transformer computations.

This paper tackles the problem of understanding how large language models work internally by extracting "algorithmic cores," which are compact subspaces necessary and sufficient for task performance. They found that independently trained transformers converge to the same cores, and these cores reveal low-dimensional invariants that persist across training runs and scales.

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

View on arXiv PDF

Similar