ML LG OCJul 12, 2025

Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization

Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

arXiv:2507.09093v14 citationsh-index: 22

Originality Highly original

AI Analysis

This addresses convergence issues in optimization for machine learning under realistic heavy-tailed noise conditions, representing a theoretical advancement with some incremental improvements over prior work.

The paper tackles the problem of stochastic gradient descent (SGD) convergence under heavy-tailed noise in non-convex optimization, showing that nonlinear SGD achieves rate Õ(t^{-1/2}) with exponentially decaying tails for symmetric noise, and proposes symmetrized estimators (SGE and MSGE) to handle non-symmetric noise with similar guarantees.

We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

View on arXiv PDF

Similar