MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training
This addresses scalability and acceleration issues in distributed DNN training, which is crucial for large-scale machine learning applications, though it appears incremental as it builds on existing gradient sparsification techniques.
The paper tackles the problem of high computational cost and increased communication traffic in gradient sparsification for distributed deep neural network training by proposing MiCRO, a method that partitions gradients and estimates thresholds to achieve near-zero cost, resulting in outperforming state-of-the-art sparsifiers with an outstanding convergence rate in experiments.
Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.