When Does Stochastic Gradient Algorithm Work Well?
This work addresses the problem of predicting efficient SGD behavior for researchers and practitioners in machine learning, though it is incremental as it builds on existing SGD analysis with new assumptions.
The paper tackles the problem of understanding when stochastic gradient descent (SGD) with a fixed, large step size works efficiently by proposing a novel assumption on the objective function, leading to improved convergence rates to a neighborhood of optimal solutions, as empirically validated on logistic regression and deep neural networks with classical datasets.
In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a fixed, large step size and propose a novel assumption on the objective function, under which this method has the improved convergence rates (to a neighborhood of the optimal solutions). We then empirically demonstrate that these assumptions hold for logistic regression and standard deep neural networks on classical data sets. Thus our analysis helps to explain when efficient behavior can be expected from the SGD method in training classification models and deep neural networks.