LGDCOCDec 31, 2020

CADA: Communication-Adaptive Distributed Adam

arXiv:2012.15469v125 citations
AI Analysis

This work addresses the problem of high communication costs in distributed machine learning for practitioners using adaptive optimizers like Adam.

This paper introduces CADA, a communication-adaptive variant of the Adam optimizer for distributed machine learning. CADA adaptively reuses stale Adam gradients to reduce communication uploads, achieving comparable convergence rates to the original Adam while significantly reducing total communication rounds in experiments.

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes