DCLGOct 17, 2018

Distributed Learning over Unreliable Networks

arXiv:1810.07766v466 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of designing robust distributed learning systems for scenarios with unreliable network conditions, though it is incremental as it builds on existing parameter server architectures.

The paper tackles the problem of distributed machine learning over unreliable networks, where messages may be dropped, and proves that algorithms can achieve convergence rates comparable to those over reliable networks, with the influence of packet drop rates diminishing as the number of parameter servers increases.

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability $p$ of being dropped, does there exist an algorithm that still converges, and at what speed? The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes