Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?
For practitioners training large language models, this work offers a simple, low-cost method to mitigate training instabilities caused by heavy-tailed gradient noise.
The paper addresses training instabilities from heavy-tailed stochastic gradient noise in language models. It proposes an entry-wise clipping method that achieves spectral control, saving ~7% of training tokens on NanoGPT pretraining with Adam and ~2% on top of Muon.
Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging. Existing remedies trade off structure against cost: vector-norm clipping ignores the matrix structure of weight updates, while spectral normalization (e.g., Muon (Jordan et al., 2024)) respects it at additional cost. We show that this trade-off can be balanced. Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish $O(ε^{-4})$ convergence guarantee under Cauchy-contaminated noise. Empirically, we find that smooth shrinkage improves Adam on NanoGPT pretraining, saving ${\sim}7\%$ of training tokens. We further find that applying the entry-wise clipping before spectral normalization yields a ${\sim}2\%$ token saving on top of Muon.