LGOCMLNov 15, 2024

MARS: Unleashing the Power of Variance Reduction for Training Large Models

arXiv:2411.10438v448 citationsh-index: 8Has CodeICML
Originality Incremental advance
AI Analysis

This work addresses the challenge of scalable and efficient training for deep neural networks and large language models, representing an incremental advancement by adapting existing variance reduction techniques to modern optimizers.

The paper tackles the problem of inefficient training for large models by proposing MARS, a unified optimization framework that integrates variance reduction with preconditioned gradient methods, achieving significant performance improvements over AdamW in training GPT-2 models.

Training deep neural networks--and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin. The implementation of MARS is available at https://github.com/AGI-Arena/MARS.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes