LGOCJun 6, 2024

Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

arXiv:2406.04443v315 citations
Originality Highly original
AI Analysis

This addresses a critical issue for training large-scale deep learning models like LLMs, where heavy-tailed noise is common, by providing theoretical and practical improvements to adaptive optimization methods.

The paper tackles the problem of poor high-probability convergence for AdaGrad and Adam methods when stochastic gradient noise is heavy-tailed, showing that gradient clipping fixes this issue and leads to provably good convergence with polylogarithmic dependence on confidence levels, supported by empirical evaluations.

Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes