DCLGApr 21, 2020

torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models

arXiv:2004.09910v160 citationsHas Code
AI Analysis

This provides a ready-to-use solution for researchers and practitioners needing to scale model training in PyTorch, but it is incremental as it builds on existing GPipe concepts.

The authors tackled the problem of training giant models by developing torchgpipe, a PyTorch library for micro-batch pipeline parallelism with checkpointing, demonstrating its efficiency on architectures like AmoebaNet-D and U-Net.

We design and implement a ready-to-use library in PyTorch for performing micro-batch pipeline parallelism with checkpointing proposed by GPipe (Huang et al., 2019). In particular, we develop a set of design components to enable pipeline-parallel gradient computation in PyTorch's define-by-run and eager execution environment. We show that each component is necessary to fully benefit from pipeline parallelism in such environment, and demonstrate the efficiency of the library by applying it to various network architectures including AmoebaNet-D and U-Net. Our library is available at https://github.com/kakaobrain/torchgpipe .

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes