DC AI CL LGNov 11, 2022

Breadth-First Pipeline Parallelism

arXiv:2211.05953v25 citationsh-index: 6

Originality Highly original

AI Analysis

This addresses training inefficiency for large-scale AI models, offering a novel method for optimizing pipeline and data parallelism.

The paper tackles the problem of inefficient training of large models by introducing Breadth-First Pipeline Parallelism, which increased training throughput by up to 43% for a 52 billion-parameter model compared to Megatron-LM.

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

View on arXiv PDF

Similar