MLDCLGJun 11, 2018

ATOMO: Communication-efficient Learning via Atomic Sparsification

arXiv:1806.04090v3386 citations
AI Analysis

This addresses communication bottlenecks in distributed machine learning, offering a flexible solution that generalizes existing sparsification techniques.

The paper tackles communication overhead in distributed model training by introducing ATOMO, a general framework for atomic sparsification of stochastic gradients, which reduces variance and speeds up training, with sparsifying singular value decompositions showing significant improvements over coordinate-based methods.

Distributed model training suffers from communication overheads due to frequent gradient updates transmitted between compute nodes. To mitigate these overheads, several studies propose the use of sparsified stochastic gradients. We argue that these are facets of a general sparsification method that can operate on any possible atomic decomposition. Notable examples include element-wise, singular value, and Fourier decompositions. We present ATOMO, a general framework for atomic sparsification of stochastic gradients. Given a gradient, an atomic decomposition, and a sparsity budget, ATOMO gives a random unbiased sparsification of the atoms minimizing variance. We show that recent methods such as QSGD and TernGrad are special cases of ATOMO and that sparsifiying the singular value decomposition of neural networks gradients, rather than their coordinates, can lead to significantly faster distributed training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes