LGOCSep 20, 2024

Convergence of Distributed Adaptive Optimization with Local Updates

arXiv:2409.13155v26 citationsh-index: 8
AI Analysis

This work addresses a gap in theoretical understanding for distributed machine learning practitioners, though it appears incremental as it builds on existing adaptive methods.

The paper tackles the problem of understanding the theoretical benefits of local updates in distributed adaptive optimization, proving for the first time that Local SGD with momentum and Local Adam can outperform minibatch versions in convex and weakly convex settings under certain regimes.

We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes