LGOct 2, 2025

Randomized Gradient Subspaces for Efficient Large Language Model Training

Sahar Rajabi, Nayeema Nonta, Samanvay Vajpayee, Sirisha Rambhatla

arXiv:2510.01878v14.1h-index: 11

Originality Highly original

AI Analysis

This addresses memory efficiency for LLM training, offering a novel method with specific gains.

The paper tackles the memory bottleneck in large language model training by analyzing gradient subspaces and introducing randomized algorithms, achieving state-of-the-art memory savings and improved performance on LLaMA-1B and LLaMA-7B pretraining.

Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.

View on arXiv PDF

Similar