LGAINov 6, 2025

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

arXiv:2511.04217v11 citationsh-index: 9
Originality Highly original
AI Analysis

This work provides a theoretical foundation for the strong lottery ticket hypothesis in transformers, addressing a core bottleneck in understanding their initialization and pruning, which is incremental but important for the machine learning community.

The paper tackles the lack of theoretical understanding of the strong lottery ticket hypothesis for transformer architectures, specifically multi-head attention mechanisms, by proving that randomly initialized MHAs contain subnetworks that approximate arbitrary MHAs with high probability, given a hidden dimension of O(d log(Hd^{3/2})), and empirically showing that approximation error decreases exponentially with increased hidden dimension.

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of $H$ heads and input dimension $d$ has the hidden dimension $O(d\log(Hd^{3/2}))$ for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes