PF CLFeb 27, 2020

Optimizing Memory-Access Patterns for Deep Learning Accelerators

Hongbin Zheng, Sejong Oh, Huiqing Wang, Preston Briggs, Jiading Gai, Animesh Jain, Yizhi Liu, Rich Heaton, Randy Huang, Yida Wang

arXiv:2002.12798v13.311 citations

Originality Incremental advance

AI Analysis

This addresses performance bottlenecks in deep learning accelerators for users of cloud-based inference services, representing an incremental improvement in optimization techniques.

The paper tackles the problem of optimizing memory-access patterns to fully utilize compute power in deep learning accelerators, achieving substantial reductions in memory accesses for common neural-network models on Amazon's Inferentia chip.

Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads; however, it is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can result in significant performance loss. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses. Experiments show that our approach can substantially reduce the impact of memory accesses required by common neural-network models on a homegrown AWS machine-learning inference chip named Inferentia, which is available through Amazon EC2 Inf1 instances.

View on arXiv PDF

Similar