MLLGJan 15, 2016

Faster Asynchronous SGD

arXiv:1601.04033v122 citations
Originality Incremental advance
AI Analysis

This addresses convergence issues in distributed machine learning for practitioners, though it is incremental as it builds on existing staleness mitigation approaches.

The paper tackles the problem of stale gradients in asynchronous distributed stochastic gradient descent, which hinder convergence, by proposing a method that quantifies staleness using moving averages of gradient statistics. The result shows improved convergence speed and scalability, with a bandwidth reduction factor of 5 and minimal impact on convergence cost.

Asynchronous distributed stochastic gradient descent methods have trouble converging because of stale gradients. A gradient update sent to a parameter server by a client is stale if the parameters used to calculate that gradient have since been updated on the server. Approaches have been proposed to circumvent this problem that quantify staleness in terms of the number of elapsed updates. In this work, we propose a novel method that quantifies staleness in terms of moving averages of gradient statistics. We show that this method outperforms previous methods with respect to convergence speed and scalability to many clients. We also discuss how an extension to this method can be used to dramatically reduce bandwidth costs in a distributed training context. In particular, our method allows reduction of total bandwidth usage by a factor of 5 with little impact on cost convergence. We also describe (and link to) a software library that we have used to simulate these algorithms deterministically on a single machine.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes