LGMar 21

Optimal low-rank stochastic gradient estimation for LLM training

arXiv:2603.2063272.1h-index: 3
AI Analysis

Provides a principled, optimal low-rank gradient estimation method to reduce memory and improve training for large language models, a critical bottleneck in LLM development.

The paper proposes an unbiased, memory-efficient low-rank stochastic gradient estimator for LLM training that minimizes variance via optimally designed random projections. In RoBERTa-large fine-tuning, it achieves 3.83GB peak GPU memory vs. 16.7GB for full backpropagation, and in LLaMA pretraining (20M-100M parameters), it outperforms traditional methods.

Large language model (LLM) training is often bottlenecked by memory constraints and stochastic gradient noise in extremely high-dimensional parameter spaces. Motivated by empirical evidence that many LLM gradient matrices are effectively low-rank during training, we present an unbiased, memory-efficient, low-rank matrix estimator with the lowest variance that is applicable across common stochastic gradient estimation paradigms. The core idea is to project a high-dimensional stochastic gradient estimator onto a random low-dimensional subspace and lift it back, reducing memory while keeping the estimator unbiased and controlling mean-squared error via an optimally designed projection distribution, including Haar--Stiefel projections. The projection distribution is derived by solving a constrained functional optimization problem, yielding an optimal random projector that guides algorithm design. Empirically, the resulting low-rank gradient estimators deliver both practical memory savings and improved training behavior. In RoBERTa-large fine-tuning, our method attains the lowest peak GPU memory among compared methods (e.g., 3.83GB versus 16.7GB for full BP) while remaining competitive in accuracy; in autoregressive LLM pretraining (LLaMA-20M/60M/100M), our method outperforms the traditional methods, supporting the benefit of the proposed optimal projection strategy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes