LGAIApr 3, 2025

On Vanishing Variance in Transformer Length Generalization

Oxford
arXiv:2504.02827v14 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses a critical limitation in Transformer models for AI applications requiring robust reasoning over varying sequence lengths, though it is an incremental improvement focused on specific tasks.

The paper tackles the problem of Transformers failing to generalize to longer sequences after training on shorter ones, showing that longer sequences cause vanishing variance in attention outputs and that applying layer normalization after attention improves length generalization on tasks like argmax retrieval and dictionary lookup.

It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes