ITDCLGSPNov 3, 2020

Gradient Coding with Dynamic Clustering for Straggler Mitigation

arXiv:2011.01922v17 citations
AI Analysis

This work addresses performance bottlenecks in distributed machine learning for applications like large-scale training, though it is incremental as it builds on existing gradient coding techniques.

The paper tackles the problem of slow workers (stragglers) in distributed gradient descent by proposing a gradient coding scheme with dynamic clustering (GC-DC), which reduces the average iteration completion time by up to 30% compared to the original method without increasing communication load.

In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest \textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed computation techniques are implemented by assigning redundant computations to workers. In this paper, we propose a novel gradient coding (GC) scheme that utilizes dynamic clustering, denoted by GC-DC, to speed up the gradient calculation. Under time-correlated straggling behavior, GC-DC aims at regulating the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes