CSER: Communication-efficient SGD with Error Reset
This addresses communication inefficiencies in distributed deep learning, offering significant speedups for training on large datasets like CIFAR-100 and ImageNet, though it is incremental as it builds on existing compression techniques.
The paper tackles the communication bottleneck in distributed SGD by proposing CSER, a variant that uses error reset and partial synchronization to enable aggressive compression, achieving up to 10x speedup on CIFAR-100 and 4.5x on ImageNet.
The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms accelerate the distributed training by nearly 10x for CIFAR-100, and by 4.5x for ImageNet.