Pipeline Parallelism with Controllable Memory
This work addresses memory inefficiency in pipeline parallelism for distributed deep learning training, offering a systematic method to reduce memory usage without sacrificing performance, which is incremental but impactful for scaling large models.
The paper tackles the problem of inefficient memory usage in pipeline parallelism schedules by introducing a framework that decomposes schedules into building blocks, enabling controllable activation memory. The result is a family of memory-efficient building blocks that reduce peak activation memory to as low as 1/3 of the 1F1B baseline while maintaining or improving throughput, with evaluations showing throughput improvements of 7% to 55% in pure pipeline settings and 16% for large language models.
Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism.