Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
This work addresses efficiency in large-scale machine learning training by reducing communication overhead, though it is incremental as it builds on existing local-update SGD methods.
The paper tackles the problem of system variability in distributed SGD by designing AdaComm, an adaptive communication strategy that adjusts averaging frequency to optimize error-runtime trade-offs. Experiments show AdaComm achieves the same final training loss as synchronous SGD while taking 3 times less time.
Large-scale machine learning training, in particular distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that AdaComm can take $3 \times$ less time than fully synchronous SGD, and still reach the same final training loss.