LG CR DC MLMay 25, 2018

Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

arXiv:1805.10032v326.978 citationsHas Code

Originality Highly original

AI Analysis

This addresses fault-tolerance in distributed machine learning, offering a more robust solution than prior methods that required a majority of non-faulty nodes.

The paper tackles the problem of making distributed stochastic gradient descent tolerant to an arbitrary number of faulty workers, achieving this with only one non-faulty worker and proving convergence for non-convex problems.

We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.

View on arXiv PDF Code

Similar