Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD
This addresses communication bottlenecks in large-scale distributed machine learning, but is incremental as it extends known sparsification results to an asynchronous, non-convex setting.
The paper tackles the problem of staleness in distributed asynchronous SGD by applying sparsification to reduce communication overhead, and proves theoretically and empirically that sparsification does not harm convergence, matching the standard SGD rate of O(1/√T).
Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. In this work we study the performance of asynchronous, distributed settings, when applying sparsification, a technique used to reduce communication overheads. In particular, for the first time in an asynchronous, non-convex setting, we theoretically prove that, in presence of staleness, sparsification does not harm SGD performance: the ergodic convergence rate matches the known result of standard SGD, that is $\mathcal{O} \left( 1/\sqrt{T} \right)$. We also carry out an empirical study to complement our theory, and confirm that the effects of sparsification on the convergence rate are negligible, when compared to 'vanilla' SGD, even in the challenging scenario of an asynchronous, distributed system.