SEAILGDec 3, 2021

Understanding Performance Problems in Deep Learning Systems

arXiv:2112.01771v235 citations
Originality Synthesis-oriented
AI Analysis

This addresses performance challenges in deep learning systems, which can cause resource waste and financial loss, by providing the first comprehensive analysis and a tool for detection, though it is incremental in applying existing methods to a new domain.

The study characterized performance problems in deep learning systems by analyzing 224 issues from StackOverflow and created a benchmark of 58 problems, leading to a static checker that detected 488 new issues in GitHub projects, with 105 confirmed and 27 fixed.

Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker Deep-Perf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes