DCLGJul 14, 2021

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

arXiv:2107.06925v5182 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks in large-scale neural network training for AI researchers and practitioners, representing an incremental improvement over existing pipeline parallelism methods.

The paper tackles the challenge of training large deep learning models at scale by proposing Chimera, a novel bidirectional pipeline parallelism scheme that reduces bubbles by up to 50% and improves training throughput by 1.16x-2.34x over state-of-the-art methods for a GPT-2 model with 1.3 billion parameters.

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes