Gradient Coding
This addresses the issue of stragglers slowing down distributed learning systems, which is an incremental improvement over existing methods.
The paper tackles the problem of stragglers in distributed learning by proposing a coding theoretic framework that replicates data blocks and codes across gradients to provide tolerance to failures and stragglers for Synchronous Gradient Descent, showing comparisons in running time and generalization error against baseline approaches on Amazon EC2.
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for Synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and generalization error.