LGDCMLOct 8, 2018

Toward Understanding the Impact of Staleness in Distributed Machine Learning

arXiv:1810.03264v193 citations
Originality Incremental advance
AI Analysis

This addresses the problem of inconsistent results in distributed ML systems for researchers and practitioners, offering empirical insights and theoretical grounding, though it is incremental in nature.

The paper investigates how stale parameters affect convergence in distributed machine learning, finding diverse impacts across models and algorithms, and provides a new convergence analysis matching the O(1/√T) rate for non-convex optimization.

Many distributed machine learning (ML) systems adopt the non-synchronous execution in order to alleviate the network communication bottleneck, resulting in stale parameters that do not reflect the latest updates. Despite much development in large-scale ML, the effects of staleness on learning are inconclusive as it is challenging to directly monitor or control staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of stochastic gradient descent in non-convex optimization under staleness, matching the best-known convergence rate of O(1/\sqrt{T}).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes