Aiming towards the minimizers: fast convergence of SGD for overparametrized problems
This addresses the convergence speed problem for practitioners using SGD in overparametrized settings like deep learning, offering a theoretical improvement over existing guarantees that require small steps and result in slower rates.
The paper tackles the slow convergence of stochastic gradient descent (SGD) in overparametrized problems by proposing a regularity condition that allows SGD to achieve the same worst-case iteration complexity as deterministic gradient descent, using only single or minibatch gradients per iteration, and demonstrates this condition holds for wide feedforward neural networks with linear output layers.
Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.