LGDCSYMar 12, 2021

EventGraD: Event-Triggered Communication in Parallel Machine Learning

arXiv:2103.07454v29 citations
AI Analysis

This addresses communication bottlenecks in parallel ML systems, though it appears incremental as it modifies existing SGD approaches.

The paper tackles communication overhead in parallel machine learning by proposing EventGraD, an event-triggered communication algorithm for stochastic gradient descent that reduces communication load by up to 60% while maintaining accuracy on CIFAR-10 dataset training.

Communication in parallel systems imposes significant overhead which often turns out to be a bottleneck in parallel machine learning. To relieve some of this overhead, in this paper, we present EventGraD - an algorithm with event-triggered communication for stochastic gradient descent in parallel machine learning. The main idea of this algorithm is to modify the requirement of communication at every iteration in standard implementations of stochastic gradient descent in parallel machine learning to communicating only when necessary at certain iterations. We provide theoretical analysis of convergence of our proposed algorithm. We also implement the proposed algorithm for data-parallel training of a popular residual neural network used for training the CIFAR-10 dataset and show that EventGraD can reduce the communication load by up to 60% while retaining the same level of accuracy. In addition, EventGraD can be combined with other approaches such as Top-K sparsification to decrease communication further while maintaining accuracy.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes