DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis
This work addresses GPU design optimization for efficient CNN training acceleration, which is crucial for researchers and engineers in deep learning and hardware design, though it is incremental as it builds on existing performance modeling approaches.
The paper tackles the problem of accurately modeling GPU performance for deep learning applications, specifically convolutional neural networks (CNNs), by introducing DeLTA, an analytical model that estimates traffic at each GPU memory hierarchy level with complex reuse patterns, achieving accurate and robust results across different CNNs and GPU architectures.
Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement.