On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes
This addresses the theoretical understanding of perturbation effects in SGD for composite optimization, relevant for deep learning practitioners dealing with gradient noise and spiking issues.
The paper tackles the problem of analyzing stochastic gradient descent (SGD) with perturbations in both forward and backward passes for composite optimization, showing that perturbations cascade and amplify geometrically with the number of operators. It provides the first comprehensive theoretical analysis, deriving convergence guarantees for non-convex and Polyak–Łojasiewicz objectives, and explains gradient spiking in deep learning, with experiments validating the theories.
We study stochastic gradient descent (SGD) for composite optimization problems with $N$ sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on logistic regression with convex and non-convex regularization validate our theories, illustrating the predicted spike behavior and the asymmetric sensitivity to forward versus backward perturbations.