Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms
This work addresses the problem of improving speed and scalability in distributed SGD for machine learning practitioners, offering a foundational framework for analysis and design, though it is incremental in building upon existing communication-reduction strategies.
The paper tackles the lack of rigorous convergence analysis for communication-efficient SGD algorithms by introducing a unified framework called Cooperative SGD, which subsumes existing methods and provides novel convergence guarantees, enabling the design of new algorithms that balance communication reduction with fast error convergence and low error floor.
Communication-efficient SGD algorithms, which allow nodes to perform local updates and periodically synchronize local models, are highly effective in improving the speed and scalability of distributed SGD. However, a rigorous convergence analysis and comparative study of different communication-reduction strategies remains a largely open problem. This paper presents a unified framework called Cooperative SGD that subsumes existing communication-efficient SGD algorithms such as periodic-averaging, elastic-averaging and decentralized SGD. By analyzing Cooperative SGD, we provide novel convergence guarantees for existing algorithms. Moreover, this framework enables us to design new communication-efficient SGD algorithms that strike the best balance between reducing communication overhead and achieving fast error convergence with low error floor.