Detached Error Feedback for Distributed SGD with Random Sparsification
This work addresses communication efficiency in large-scale distributed deep learning, offering incremental improvements over existing methods.
The paper tackles the communication bottleneck in distributed SGD by proposing a detached error feedback algorithm that improves convergence and generalization bounds for non-convex problems, showing significant empirical improvements in deep learning experiments.
The communication bottleneck has been a critical problem in large-scale distributed deep learning. In this work, we study distributed SGD with random block-wise sparsification as the gradient compressor, which is ring-allreduce compatible and highly computation-efficient but leads to inferior performance. To tackle this important issue, we improve the communication-efficient distributed SGD from a novel aspect, that is, the trade-off between the variance and second moment of the gradient. With this motivation, we propose a new detached error feedback (DEF) algorithm, which shows better convergence bound than error feedback for non-convex problems. We also propose DEF-A to accelerate the generalization of DEF at the early stages of the training, which shows better generalization bounds than DEF. Furthermore, we establish the connection between communication-efficient distributed SGD and SGD with iterate averaging (SGD-IA) for the first time. Extensive deep learning experiments show significant empirical improvement of the proposed methods under various settings.