On Communication Compression for Distributed Optimization on Heterogeneous Data
This addresses communication bottlenecks in distributed machine learning training for scenarios with non-iid data, but it is incremental as it builds on existing compression methods.
The paper analyzed the impact of heterogeneous data on distributed optimization with gradient compression, finding that D-EF-SGD is less affected than D-QSGD but both slow down with high data skewness, and identified alternatives like a method for strongly convex problems and a general approach for linear compressors.
Lossy gradient compression, with either unbiased or biased compressors, has become a key tool to avoid the communication bottleneck in centrally coordinated distributed training of machine learning models. We analyze the performance of two standard and general types of methods: (i) distributed quantized SGD (D-QSGD) with arbitrary unbiased quantizers and (ii) distributed SGD with error-feedback and biased compressors (D-EF-SGD) in the heterogeneous (non-iid) data setting. Our results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high. We further study two alternatives that are not (or much less) affected by heterogenous data distributions: first, a recently proposed method that is effective on strongly convex problems, and secondly, we point out a more general approach that is applicable to linear compressors only but effective in all considered scenarios.