LG MLOct 14, 2024

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

Weronika Ormaniec, Felix Dangel, Sidak Pal Singh

ETH Zurich

arXiv:2410.10986v216.413 citationsh-index: 11Has CodeICLR

Originality Incremental advance

AI Analysis

This provides foundational insights into Transformer optimization for researchers and practitioners, though it is incremental in building theoretical understanding.

The paper tackles the problem of understanding why Transformers require specific optimization techniques like adaptive optimizers and layer normalization, by theoretically analyzing the Hessian of a single self-attention layer and comparing it to classical networks, revealing that Transformers have highly non-linear dependencies on data and weights.

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.

View on arXiv PDF Code

Similar