LGDCOct 3, 2021

Scheduling Optimization Techniques for Neural Network Training

arXiv:2110.00929v1
Originality Incremental advance
AI Analysis

This work addresses GPU inefficiency for researchers and practitioners training neural networks, offering incremental optimizations to existing scheduling methods.

The paper tackles the problem of GPU underutilization during neural network training by proposing out-of-order backprop, a scheduling technique that reorders gradient computations to improve GPU utilization. The result shows substantial throughput improvements in single-GPU, data-parallel, and pipeline-parallel training compared to state-of-the-art systems, as evaluated with models like MobileNet, BERT, and GPT-3 on up to 48 V100 GPUs.

Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the training.This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo back-prop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream out-of-order computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline stalls.We evaluate our optimizations with twelve neural networks including a light-weight computer vision model (MobileNet) and largeNLP models (BERT and GPT-3) with up to forty eight V100 GPUs.Our scheduling algorithms effectively improve the performance of single-GPU training as well as data- and pipeline-parallel training.Compared to the respective state of the art training systems, the throughput is substantially improved for single-GPU, data-parallel, and pipeline-parallel training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes