PFCLFeb 27, 2020

Optimizing Memory-Access Patterns for Deep Learning Accelerators

arXiv:2002.12798v111 citations
Originality Incremental advance
AI Analysis

This addresses performance bottlenecks in deep learning accelerators for users of cloud-based inference services, representing an incremental improvement in optimization techniques.

The paper tackles the problem of optimizing memory-access patterns to fully utilize compute power in deep learning accelerators, achieving substantial reductions in memory accesses for common neural-network models on Amazon's Inferentia chip.

Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads; however, it is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can result in significant performance loss. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses. Experiments show that our approach can substantially reduce the impact of memory accesses required by common neural-network models on a homegrown AWS machine-learning inference chip named Inferentia, which is available through Amazon EC2 Inf1 instances.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes