Last iterate convergence of SGD for Least-Squares in the Interpolation regime
This work provides theoretical guarantees for the last iterate convergence of SGD in over-parameterized least-squares, which is relevant for researchers studying the optimization and generalization properties of machine learning models, particularly those that interpolate their training data.
This paper investigates the last iterate convergence of Stochastic Gradient Descent (SGD) with a constant step-size for least-squares problems in the interpolation regime, where the optimal predictor perfectly fits the data. The authors demonstrate explicit convergence for non-strongly convex problems and provide non-asymptotic polynomial convergence rates that can be faster than O(1/T) in the over-parameterized setting.
Motivated by the recent successes of neural networks that have the ability to fit the data perfectly and generalize well, we study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs and outputs $\langle θ_* , φ(X) \rangle = Y$, where $φ(X)$ stands for a possibly infinite dimensional non-linear feature map. To solve this problem, we consider the estimator given by the last iterate of stochastic gradient descent (SGD) with constant step-size. In this context, our contribution is two fold: (i) from a (stochastic) optimization perspective, we exhibit an archetypal problem where we can show explicitly the convergence of SGD final iterate for a non-strongly convex problem with constant step-size whereas usual results use some form of average and (ii) from a statistical perspective, we give explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$. The link with reproducing kernel Hilbert spaces is established.