LG AIApr 3, 2025

On Vanishing Variance in Transformer Length Generalization

Ruining Li, Gabrijel Boduljak, Jensen, Zhou

Oxford

arXiv:2504.02827v113.06 citationsh-index: 8

Originality Incremental advance

AI Analysis

This addresses a critical limitation in Transformer models for AI applications requiring robust reasoning over varying sequence lengths, though it is an incremental improvement focused on specific tasks.

The paper tackles the problem of Transformers failing to generalize to longer sequences after training on shorter ones, showing that longer sequences cause vanishing variance in attention outputs and that applying layer normalization after attention improves length generalization on tasks like argmax retrieval and dictionary lookup.

It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.

View on arXiv PDF

Similar