LG AIApr 16

Stability and Generalization in Looped Transformers

arXiv:2604.1525957.33 citationsh-index: 1

Predicted impact top 41% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers designing looped transformer architectures, this work provides theoretical and empirical guidance on achieving test-time scaling, though it is incremental as it builds on existing concepts of fixed-point iteration and normalization.

Looped transformers struggle to generalize to harder problems at test time due to architectural choices. The paper introduces a fixed-point framework analyzing stability (reachability, input-dependence, geometry) and shows that recall with outer normalization enables reliable generalization, with downstream performance tracking framework predictions across chess, sudoku, and prefix-sums tasks.

Looped transformers promise test-time compute scaling by spending more iterations on harder problems, but it remains unclear which architectural choices let them extrapolate to harder problems at test time rather than memorize training-specific solutions. We introduce a fixed-point based framework for analyzing looped architectures along three axes of stability -- reachability, input-dependence, and geometry -- and use it to characterize when fixed-point iteration yields meaningful predictions. Theoretically, we prove that looped networks without recall have countable fixed points and cannot achieve strong input-dependence at any spectral regime, while recall combined with outer normalization reliably produces a regime in which fixed points are simultaneously reachable, locally smooth in the input, and supported by stable backpropagation. Empirically, we train single-layer looped transformers on chess, sudoku, and prefix-sums and find that downstream performance tracks the framework's predictions across tasks and architectural configurations. We additionally introduce internal recall, a novel recall placement variant, and show that it becomes competitive with -- and on sudoku, substantially better than -- standard recall placement once outer normalization is applied.

View on arXiv PDF

Similar