Accelerating Distributed ML Training via Selective Synchronization
This addresses the scalability bottleneck in distributed machine learning training, offering a practical solution for accelerating training without sacrificing accuracy, though it is incremental as it builds on existing semi-synchronous approaches.
The paper tackles the high communication cost in distributed deep neural network training by introducing SelSync, a method that dynamically chooses when to synchronize updates based on their significance, achieving up to 14x faster training time while maintaining or improving accuracy compared to bulk-synchronous parallel training.
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.