LGOCMLApr 28, 2021

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

arXiv:2104.13818v235 citations
AI Analysis

This work addresses communication bottlenecks in data-parallel SGD for large-scale model training, offering an incremental improvement over prior quantization techniques.

The authors tackled the problem of communication overhead in distributed training by proposing a new gradient quantization scheme, NUQSGD, which achieved stronger theoretical guarantees and matched or exceeded the empirical performance of existing methods like QSGDinf.

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes