LGDCNov 10, 2022

On Optimizing the Communication of Model Parallelism

Berkeley
arXiv:2211.05322v248 citationsh-index: 67
Originality Incremental advance
AI Analysis

This addresses a critical bottleneck for scaling large models on clusters, though it is incremental as it optimizes an existing communication pattern.

The paper tackles the communication inefficiency of cross-mesh resharding in model-parallel deep learning, proposing a system that improves throughput by up to 10x in microbenchmarks and 10-50% in end-to-end training of large models like GPT-3 and U-Transformer.

We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes