LG DCNov 10, 2022

On Optimizing the Communication of Model Parallelism

Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

Berkeley

arXiv:2211.05322v217.748 citationsh-index: 67

Originality Incremental advance

AI Analysis

This addresses a critical bottleneck for scaling large models on clusters, though it is incremental as it optimizes an existing communication pattern.

The paper tackles the communication inefficiency of cross-mesh resharding in model-parallel deep learning, proposing a system that improves throughput by up to 10x in microbenchmarks and 10-50% in end-to-end training of large models like GPT-3 and U-Transformer.

We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.

View on arXiv PDF

Similar