MLLGSep 29, 2017

Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling

arXiv:1709.10432v1119 citations
Originality Incremental advance
AI Analysis

This work addresses the convergence guarantees for a common data processing practice in large-scale machine learning, providing theoretical insights for practitioners to optimize data handling methods.

The paper analyzes the convergence properties of distributed stochastic gradient descent (SGD) with data shuffling, proving that global shuffling ensures convergence for both convex and non-convex tasks, while local shuffling has slower rates, and it identifies conditions where insufficient shuffling does not affect convergence.

When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines if needed, and then perform several epochs of training on the re-shuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call data partition with global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. We prove that SGD with global shuffling has convergence guarantee in both convex and non-convex cases. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Second, we conduct the convergence analysis for SGD with local shuffling. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. Finally, we consider the situation when the permutation after shuffling is not uniformly distributed (insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes