LG DCDec 4, 2020

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun

arXiv:2012.02732v112.487 citationsHas Code

Originality Highly original

AI Analysis

This work significantly improves the efficiency of deep learning inference and training for practitioners by reducing GPU scheduling overhead and enabling parallel execution.

This paper addresses inefficiencies in GPU task scheduling within deep learning frameworks, such as large overhead and serial execution. The authors propose Nimble, a DL execution engine that uses ahead-of-time (AoT) scheduling and automatic parallelization via multiple GPU streams, achieving up to 22.34x speedup for inference and 3.61x for training compared to PyTorch.

Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling. Here, the scheduling procedure finishes before executing the GPU kernel, thereby removing most of the scheduling overhead during run time. Furthermore, Nimble automatically parallelizes the execution of GPU tasks by exploiting multiple GPU streams in a single GPU. Evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34$\times$ and 3.61$\times$, respectively. Moreover, Nimble outperforms state-of-the-art inference systems, TensorRT and TVM, by up to 2.81$\times$ and 1.70$\times$, respectively.

View on arXiv PDF Code

Similar