Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws
This work addresses a theoretical gap for researchers studying optimization in large-scale deep learning, though it appears incremental by extending prior self-attention analyses to full Transformers.
The authors tackled the lack of theoretical understanding of Transformer optimization landscapes by deriving explicit second-order expressions for Layer Normalization and feedforward Hessians, completing the Hessian characterization of full Transformer blocks and informing convergence dynamics and scaling laws.
The lack of theoretical results for Layer Normalization and feedforward Hessians has left a gap in the study of Transformer optimization landscapes. We address this by deriving explicit second-order expressions for these components, thereby completing the Hessian characterization of full Transformer blocks. Our results generalize prior self-attention analyses and yield estimations for the role of each sublayer in curvature propagation. We demonstrate how these Hessian structures inform both convergence dynamics and the empirical scaling laws governing large-model performance. Further, we propose a Taylor-expansion-based framework for analyzing loss differences to quantify convergence trajectories. By extending Hessian theory to the full Transformer architecture, this work establishes a new foundation for theoretical and empirical investigations of optimization in large-scale deep learning.