ARAISep 15, 2017

A Streaming Accelerator for Deep Convolutional Neural Networks with Image and Feature Decomposition for Resource-limited System Applications

arXiv:1709.05116v16 citations
Originality Incremental advance
AI Analysis

This addresses energy and latency bottlenecks for mobile, IoT, and UAV applications, representing an incremental hardware optimization.

The paper tackles the problem of high computation latency and energy inefficiency in deep CNNs on resource-limited systems by proposing a hardware streaming accelerator with image and feature decomposition techniques, achieving a peak throughput of 144 GOPS and peak energy efficiency of 0.8 TOPS/W in a 65 nm CMOS implementation.

Deep convolutional neural networks (CNN) are widely used in modern artificial intelligence (AI) and smart vision systems but also limited by computation latency, throughput, and energy efficiency on a resource-limited scenario, such as mobile devices, internet of things (IoT), unmanned aerial vehicles (UAV), and so on. A hardware streaming architecture is proposed to accelerate convolution and pooling computations for state-of-the-art deep CNNs. It is optimized for energy efficiency by maximizing local data reuse to reduce off-chip DRAM data access. In addition, image and feature decomposition techniques are introduced to optimize memory access pattern for an arbitrary size of image and number of features within limited on-chip SRAM capacity. A prototype accelerator was implemented in TSMC 65 nm CMOS technology with 2.3 mm x 0.8 mm core area, which achieves 144 GOPS peak throughput and 0.8 TOPS/W peak energy efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes