MLLGFeb 22, 2019

Beating SGD Saturation with Tail-Averaging and Minibatching

arXiv:1902.08668v240 citations
AI Analysis

This work provides practical insights for machine learning practitioners by optimizing SGD variants to achieve optimal learning errors in nonparametric settings, though it is incremental as it builds on existing SGD methods.

The paper tackles the problem of understanding and optimizing stochastic gradient descent (SGD) variants for nonparametric least squares learning, showing that tail-averaging enables faster convergence rates than uniform averaging and that combining it with minibatching allows more aggressive step-size choices.

While stochastic gradient descent (SGD) is one of the major workhorses in machine learning, the learning properties of many practically used variants are poorly understood. In this paper, we consider least squares learning in a nonparametric setting and contribute to filling this gap by focusing on the effect and interplay of multiple passes, mini-batching and averaging, and in particular tail averaging. Our results show how these different variants of SGD can be combined to achieve optimal learning errors, hence providing practical insights. In particular, we show for the first time in the literature that tail averaging allows faster convergence rates than uniform averaging in the nonparametric setting. Finally, we show that a combination of tail-averaging and minibatching allows more aggressive step-size choices than using any one of said components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes