The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
For deep learning researchers, this provides a theoretical explanation for the empirical success of GLU in LLMs, though the analysis is limited to two-layer NTK regime.
The paper identifies that GLU structures outperform non-GLU counterparts by reshaping the NTK spectrum, leading to smaller condition numbers and faster convergence, while empirically showing GLU does not reduce generalization gap.
Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.