Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance
This work addresses a theoretical gap in explaining optimization benefits for large language models, but it is incremental as it builds on existing smoothness assumptions and focuses on a specific data property.
The paper tackles the problem of understanding why adaptive optimization methods like sign descent outperform gradient descent in language modeling by analyzing heavy-tailed class imbalance in data distributions, and it provably shows faster convergence for sign descent over normalized gradient descent in a minimal next-token prediction setting.
Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., $\ell_\infty$ norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of $\ell_\infty$-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. $\ell_\infty$ norm) over normalized GD (steepest descent w.r.t. to $\ell_2$ norm) in the presence of heavy tail class imbalance.