DC LGMar 15, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda

arXiv:2303.08374v14.38 citationsh-index: 60

Originality Incremental advance

AI Analysis

This addresses efficiency problems for researchers and engineers training large-scale deep learning models with advanced parallelism, though it is incremental as it optimizes existing communication methods rather than introducing a new paradigm.

The paper tackles the challenge of varied communication operations in distributed deep learning by proposing MCR-DL, an extensible framework that dynamically mixes communication backends, resulting in throughput improvements such as 31% for DeepSpeed-MoE on 256 GPUs and 20-25% for other models on 32 GPUs.

In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) and Mixture-of-Experts (MoE). Communication libraries' performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.

View on arXiv PDF

Similar