DCLGPFSep 11, 2020

Hierarchical Roofline Performance Analysis for Deep Learning Applications

arXiv:2009.05257v423 citations
Originality Synthesis-oriented
AI Analysis

This provides a practical tool for developers and researchers optimizing deep learning applications on GPUs, though it is incremental as it extends existing tools.

The paper tackles the challenge of analyzing deep learning performance on NVIDIA GPUs by introducing a methodology for hierarchical Roofline analysis, validated on a climate image segmentation application using TensorFlow and PyTorch to show framework-specific performance differences.

This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of data precisions and Tensor Core support and introduces a Nsight Compute based method to accurately collect application performance information. This methodology allows for automated machine characterization and application characterization for Roofline analysis across the entire memory hierarchy on NVIDIA GPUs, and it is validated by a complex deep learning application used for climate image segmentation. We use two versions of the code, in TensorFlow and PyTorch respectively, to demonstrate the use and effectiveness of this methodology. We highlight how the application utilizes the compute and memory capabilities on the GPU and how the implementation and performance differ in two deep learning frameworks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes