Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
For deep learning practitioners, spectral clipping offers a principled way to handle heavy-tailed noise in matrix parameters, improving upon vector-based clipping with minimal overhead.
The paper introduces spectral clipping, a gradient clipping method that clamps only the leading singular values of matrix-valued gradients, stabilizing training under heavy-tailed noise. It achieves optimal convergence rates and shows competitive performance on synthetic and neural network tasks.
Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal $\mathcal{O}\left(K^{\frac{2 - 2α}{3α- 2}}\right)$ rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top $r$ singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.