LGAIOCApr 19, 2023

A Theory on Adam Instability in Large-Scale Machine Learning

Meta AI
arXiv:2304.09871v255 citationsh-index: 38
Originality Incremental advance
AI Analysis

This addresses a critical instability problem for researchers and practitioners training large-scale models, though it is incremental as it builds on known issues with Adam.

The paper tackles the unexplained divergent behavior in large language model training by identifying it as an artifact of the Adam optimizer, where parameter updates become large and uncorrelated with the descent direction, leading to divergence, especially with large batch sizes, as observed in models up to 546 billion parameters.

We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes