LGOCOct 17, 2024

Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

arXiv:2410.13954v22 citationsh-index: 56
Originality Highly original
AI Analysis

This work addresses robustness in machine learning optimization under heavy-tailed noise, offering a general framework that subsumes existing methods like clipping and provides improved theoretical guarantees, though it is incremental in extending nonlinear SGD analysis.

The paper tackles the problem of high-probability convergence in online learning with heavy-tailed noise by proposing a unified framework for nonlinear stochastic gradient descent methods, achieving convergence rates such as $\widetilde{\mathcal{O}}(t^{-1/4})$ for non-convex costs and $\mathcal{O}(t^{-ζ})$ for strongly convex costs, with constant exponents that outperform state-of-the-art when noise moments are below specific thresholds (e.g., $p < 6/5$).

We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-ζ})$, where $ζ\in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes