MLLGNAOCDec 5, 2013

Semi-Stochastic Gradient Descent Methods

arXiv:1312.1666v2245 citations
Originality Incremental advance
AI Analysis

This work addresses the computational efficiency of optimization for large-scale machine learning problems, offering a method that improves upon existing approaches like SVRG by reducing stochastic gradient evaluations, though it is incremental in nature.

The paper tackles the problem of minimizing the average of many smooth convex loss functions by proposing S2GD, a semi-stochastic gradient descent method that achieves an expected workload of O((κ/n)log(1/ε)) passes over data to output an ε-accurate solution, requiring only about 2.1 full gradient evaluations for a specific example with n=10^9 and κ=10^3 to reach 10^-6 accuracy.

In this paper we study the problem of minimizing the average of a large number ($n$) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an $\varepsilon$-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the loss, is $O((κ/n)\log(1/\varepsilon))$, where $κ$ is the condition number. This is achieved by running the method for $O(\log(1/\varepsilon))$ epochs, with a single gradient evaluation and $O(κ)$ stochastic gradient evaluations in each. The SVRG method of Johnson and Zhang arises as a special case. If our method is limited to a single epoch only, it needs to evaluate at most $O((κ/\varepsilon)\log(1/\varepsilon))$ stochastic gradients. In contrast, SVRG requires $O(κ/\varepsilon^2)$ stochastic gradients. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an $10^{-6}$-accurate solution for a problem with $n=10^9$ and $κ=10^3$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes