LG OCMay 17, 2025

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

arXiv:2505.11840v31 citationsh-index: 14

Originality Incremental advance

AI Analysis

It provides theoretical insights for practitioners using AdamW in deep learning, though it is incremental as it builds on existing convergence analysis.

This paper tackles the theoretical understanding of AdamW's convergence by establishing a convergence rate of O(√d/K^{1/4}) measured by ℓ1 norm, showing it is analogous to SGD's optimal rate in ideal cases, with empirical validation on real-world tasks.

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $||\nabla f(x)||_2\ll ||\nabla f(x)||_1\leq \sqrt{d}||\nabla f(x)||_2$ for any high-dimensional vector $x$ and $E\left[||\nabla f(x)||_1\right]\geq\sqrt{\frac{2d}π}E\left[||\nabla f(x)||_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_2\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case. We also extend our result to NAdamW, an AdamW variant that employs a double-momentum mechanism, and demonstrate that it maintains the same convergence rate.

View on arXiv PDF

Similar