Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression
This work addresses bottlenecks in distributed training for machine learning practitioners, but it is incremental as it focuses on improving existing compression approaches rather than introducing a new paradigm.
The paper tackles the problem that many gradient compression schemes fail to accelerate distributed machine learning training while preserving accuracy, by identifying common issues in prior systems and evaluation methods and showing that minor design changes can lead to notably better performance.
Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify common issues in previous gradient compression systems and evaluation methodologies. These include excessive computational overheads; incompatibility with all-reduce; and insufficient evaluation methods, such as not using an end-to-end metric or using a 32-bit baseline instead of the stronger 16-bit baseline. We revisit common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrate how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance. Our goal is to raise awareness of the need for design and evaluation standards that naturally translate to the end-to-end utility of gradient compression.