The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

arXiv:2605.0661183.6

Predicted impact top 31% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For LLM researchers, this paper offers a mechanistic understanding and a practical architectural fix for attention sinks, though the phenomenon is well-known and the fix is incremental.

This work provides a mechanistic explanation for the attention sink phenomenon in LLMs, tracing it to variance discrepancy from value aggregation in self-attention, amplified by super neurons in FFN layers. Controlled interventions replicate sinks at arbitrary positions, and a head-wise RMSNorm modification restores statistical parity, accelerating convergence.

Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.

View on arXiv PDF

Similar