LG OCSep 20, 2024

Convergence of Distributed Adaptive Optimization with Local Updates

arXiv:2409.13155v212.56 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a gap in theoretical understanding for distributed machine learning practitioners, though it appears incremental as it builds on existing adaptive methods.

The paper tackles the problem of understanding the theoretical benefits of local updates in distributed adaptive optimization, proving for the first time that Local SGD with momentum and Local Adam can outperform minibatch versions in convex and weakly convex settings under certain regimes.

We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.

View on arXiv PDF

Similar