LGDCJun 15, 2023

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

arXiv:2306.08881v115 citationsh-index: 70Has Code
Originality Incremental advance
AI Analysis

This addresses the communication bottleneck in distributed training for deep learning practitioners, offering a novel method that integrates with existing system optimizations, though it is incremental relative to prior compression techniques.

The paper tackles the problem of gradient compression methods often failing to outperform optimized synchronous SGD in distributed deep learning due to incompatibility with system optimizations, and proposes ACP-SGD, which achieves average speedups of 4.06x over S-SGD and 1.43x over Power-SGD while maintaining accuracy.

To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06x and 1.43x speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes