LGDCMLSep 22, 2020

Asynchronous Distributed Optimization with Stochastic Delays

arXiv:2009.10717v39 citations
Originality Highly original
AI Analysis

This work addresses optimization challenges in distributed machine learning systems with data partitioned across machines, offering a more efficient asynchronous algorithm for such settings.

The paper tackles the problem of asynchronous finite sum minimization in a distributed-data setting with stochastic delays, developing the ADSAGA algorithm based on SAGA, which converges in $ ilde{O}\left(\left(n + \sqrt{m}κ ight)\log(1/ε) ight)$ iterations, improving over existing methods that converge in $ ilde{O}(n^2κ\log(1/ε))$ iterations.

We study asynchronous finite sum minimization in a distributed-data setting with a central parameter server. While asynchrony is well understood in parallel settings where the data is accessible by all machines -- e.g., modifications of variance-reduced gradient algorithms like SAGA work well -- little is known for the distributed-data setting. We develop an algorithm ADSAGA based on SAGA for the distributed-data setting, in which the data is partitioned between many machines. We show that with $m$ machines, under a natural stochastic delay model with an mean delay of $m$, ADSAGA converges in $\tilde{O}\left(\left(n + \sqrt{m}κ\right)\log(1/ε)\right)$ iterations, where $n$ is the number of component functions, and $κ$ is a condition number. This complexity sits squarely between the complexity $\tilde{O}\left(\left(n + κ\right)\log(1/ε)\right)$ of SAGA \textit{without delays} and the complexity $\tilde{O}\left(\left(n + mκ\right)\log(1/ε)\right)$ of parallel asynchronous algorithms where the delays are \textit{arbitrary} (but bounded by $O(m)$), and the data is accessible by all. Existing asynchronous algorithms with distributed-data setting and arbitrary delays have only been shown to converge in $\tilde{O}(n^2κ\log(1/ε))$ iterations. We empirically compare on least-squares problems the iteration complexity and wallclock performance of ADSAGA to existing parallel and distributed algorithms, including synchronous minibatch algorithms. Our results demonstrate the wallclock advantage of variance-reduced asynchronous approaches over SGD or synchronous approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes