Variance Reduction for Distributed Stochastic Gradient Descent
This addresses a bottleneck for practitioners in distributed machine learning by enabling scalable and stable variance reduction without full gradients or extra storage, though it is incremental as it builds on existing VR methods.
The paper tackles the problem of high memory usage and exact gradient computations in variance reduction methods for stochastic gradient descent, proposing VR-lite, which eliminates these requirements and shows favorable performance compared to state-of-the-art methods in empirical comparisons.
Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical distributed settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants that are scalable and remain stable with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms perform favorably to other stochastic methods.