LGCRDCMLMay 25, 2018

Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

arXiv:1805.10032v378 citations
Originality Highly original
AI Analysis

This addresses fault-tolerance in distributed machine learning, offering a more robust solution than prior methods that required a majority of non-faulty nodes.

The paper tackles the problem of making distributed stochastic gradient descent tolerant to an arbitrary number of faulty workers, achieving this with only one non-faulty worker and proving convergence for non-convex problems.

We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes