LGDCMLJun 12, 2020

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

arXiv:2006.07405v2
Originality Incremental advance
AI Analysis

This addresses the communication bottleneck in distributed deep learning, enabling faster training of large models, though it is an incremental improvement over prior compression techniques.

The paper tackles the high communication cost in distributed Stochastic Gradient Descent (SGD) for large neural networks by introducing a two-level gradient averaging strategy (A2SGD), which reduces communication complexity to O(1) per worker and improves training time by up to 23.2x compared to existing methods.

Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker before the computation of two global averages for an updated model. A2SGD also retains local errors to maintain the variance for fast convergence. Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. Our evaluation validates the theoretical conclusion and demonstrates that A2SGD significantly reduces the communication traffic per worker, and improves the overall training time of LSTM-PTB by 3.2x and 23.2x, respectively, compared to Top-K and QSGD. To the best of our knowledge, A2SGD is the first to achieve O(1) communication complexity per worker for distributed SGD.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes