LG NE OCNov 11, 2024

Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

arXiv:2411.07120v24.61 citationsh-index: 1ICML

Originality Incremental advance

AI Analysis

This addresses memory efficiency and training speed for large language models, offering incremental improvements over existing adaptive optimizers like Adam.

The paper tackles the problem of high memory requirements and slow training in large-scale neural networks by introducing Subset-Norm and Subspace-Momentum techniques, which reduce optimizer memory by over 80% and achieve comparable validation perplexity in half the training tokens (6.8B vs 13.1B) for LLaMA 1B.

We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from $O(d)$ to $O(\sqrt{d})$, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove a high-probability convergence result for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the training tokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.

View on arXiv PDF

Similar