LG OC MLJan 26, 2019

Distributed Learning with Compressed Gradient Differences

Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, Peter Richtárik

arXiv:1901.09269v330.6211 citations

Originality Highly original

AI Analysis

This addresses the communication efficiency problem in distributed training for large-scale ML models, offering a novel solution to a known bottleneck.

The paper tackles the communication bottleneck in distributed machine learning by proposing DIANA, a method that compresses gradient differences to achieve convergence to the true optimum, with theoretical analysis showing superior rates in strongly convex and nonconvex settings.

Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new distributed learning method -- DIANA -- which resolves this issue via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are superior to existing rates. We also provide theory to support non-smooth regularizers study the difference between quantization schemes. Our analysis of block-quantization and differences between $\ell_2$ and $\ell_{\infty}$ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.

View on arXiv PDF

Similar