LGDCMLMay 22, 2018

Gradient Energy Matching for Distributed Asynchronous Gradient Descent

arXiv:1805.08469v13 citationsHas Code
Originality Highly original
AI Analysis

This addresses a critical stability problem for large-scale deep learning systems, offering an incremental improvement over existing distributed optimization methods.

The paper tackles the instability of distributed asynchronous SGD when scaling to many workers by introducing Gradient Energy Matching (GEM), a method that ensures stability by maintaining the system's energy below that of synchronous SGD with momentum, achieving stable scaling to 100 workers and better generalization.

Distributed asynchronous SGD has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous SGD under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, GEM, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential SGD with momentum. Experimental results highlight the stability and speedup of GEM compared to existing schemes, even when scaling to one hundred asynchronous workers. Results also indicate better generalization compared to the targeted SGD with momentum.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes