Variance-Reduced Methods for Machine Learning
This is an incremental review aimed at non-expert readers, summarizing developments in VR methods to improve optimization efficiency for machine learning tasks.
The paper reviews variance reduction (VR) methods for stochastic optimization in machine learning, which have emerged over the last 8 years to achieve faster convergence than traditional stochastic gradient descent (SGD) when multiple passes through data are allowed.
Stochastic optimization lies at the heart of machine learning, and its cornerstone is stochastic gradient descent (SGD), a method introduced over 60 years ago. The last 8 years have seen an exciting new development: variance reduction (VR) for stochastic optimization methods. These VR methods excel in settings where more than one pass through the training data is allowed, achieving a faster convergence than SGD in theory as well as practice. These speedups underline the surge of interest in VR methods and the fast-growing body of work on this topic. This review covers the key principles and main developments behind VR methods for optimization with finite data sets and is aimed at non-expert readers. We focus mainly on the convex setting, and leave pointers to readers interested in extensions for minimizing non-convex functions.