DCARLGPFSep 9, 2020

Time-Based Roofline for Deep Learning Performance Analysis

arXiv:2009.04598v321 citations
AI Analysis

This provides a systematic performance analysis method for deep learning practitioners, though it is incremental as it extends an existing model from high-performance computing.

The paper tackles the challenge of analyzing and optimizing compute-intensive deep learning applications by proposing a Roofline-based approach that incorporates compute/bandwidth complexity and run time, validated using 2D convolution and LSTM kernels to identify performance factors like arithmetic intensity and cache locality.

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based approach to performance analysis to facilitate the optimization of these applications. This approach is an extension of the Roofline model widely used in traditional high-performance computing applications, and it incorporates both compute/bandwidth complexity and run time in its formulae to provide insights into deep learning-specific characteristics. We take two sets of representative kernels, 2D convolution and long short-term memory, to validate and demonstrate the use of this new approach, and investigate how arithmetic intensity, cache locality, auto-tuning, kernel launch overhead, and Tensor Core usage can affect performance. Compared to the common ad-hoc approach, this study helps form a more systematic way to analyze code performance and identify optimization opportunities for deep learning applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes