DC LGSep 18, 2021

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, Chuanxiong Guo

arXiv:2109.11067v18.050 citationsh-index: 44

Originality Incremental advance

AI Analysis

This addresses the cost-efficiency problem for organizations serving DNN models, though it is incremental as it builds on existing scheduling algorithms.

The paper tackles the challenge of efficiently partitioning NVIDIA A100 GPUs with Multi-Instance GPU (MIG) for serving Deep Neural Networks (DNNs), proposing MIG-serving, which saves up to 40% of GPUs while maintaining the same throughput.

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs that partitions one physical GPU into multiple GPU instances. With MIG, A100 can be the most cost-efficient GPU ever for serving Deep Neural Networks (DNNs). However, discovering the most efficient GPU partitions is challenging. The underlying problem is NP-hard; moreover, it is a new abstract problem, which we define as the Reconfigurable Machine Scheduling Problem (RMS). This paper studies serving DNNs with MIG, a new case of RMS. We further propose a solution, MIG-serving. MIG- serving is an algorithm pipeline that blends a variety of newly designed algorithms and customized classic algorithms, including a heuristic greedy algorithm, Genetic Algorithm (GA), and Monte Carlo Tree Search algorithm (MCTS). We implement MIG-serving on Kubernetes. Our experiments show that compared to using A100 as-is, MIG-serving can save up to 40% of GPUs while providing the same throughput.

View on arXiv PDF

Similar