DCARLGNIOct 9, 2021

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

arXiv:2110.04478v353 citations
AI Analysis

This addresses a critical bottleneck in scaling distributed training for deep learning models, offering incremental improvements in communication efficiency for systems with heterogeneous network bandwidths.

The paper tackles the challenge of inefficient network bandwidth utilization in distributed deep learning training across heterogeneous multi-dimensional networks by proposing Themis, a dynamic collective scheduling scheme that improves average network bandwidth utilization by 1.72x and boosts end-to-end training iteration performance for workloads like ResNet-152 by up to 2.25x.

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes