LG DCFeb 24, 2023

DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

Lin Zhang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li, Chengjian Liu

arXiv:2302.12445v25.321 citationsh-index: 70Has Code

Originality Incremental advance

AI Analysis

This work addresses communication bottlenecks in distributed deep learning, offering incremental improvements for faster training on GPU clusters.

The paper tackles the problem of excessive startup latency and sub-optimal performance in distributed deep learning communication scheduling by proposing DeAR, a novel algorithm that decouples all-reduce operations to overlap with computations, achieving up to 83% and 15% training speedup over state-of-the-art solutions on different interconnects.

Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations. This has been commonly adopted in popular distributed deep learning frameworks. However, there exist two fundamental problems: (1) excessive startup latency proportional to the number of workers for each all-reduce operation; (2) it only achieves sub-optimal training performance due to the dependency and synchronization requirement of the feed-forward computation in the next iteration. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations, which overlaps with both backpropagation and feed-forward computations without extra communications. We further design a practical tensor fusion algorithm to improve the training performance. Experimental results with five popular models show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions on a 64-GPU cluster with 10Gb/s Ethernet and 100Gb/s InfiniBand interconnects, respectively.

View on arXiv PDF Code

Similar