AR NEMay 9, 2018

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, Reetuparna Das

arXiv:1805.03718v122.5382 citations

Originality Highly original

AI Analysis

This addresses the bottleneck of data movement and latency in DNN inference for hardware systems, offering a novel architectural solution with significant performance gains.

The paper tackles the problem of accelerating deep neural network inference by proposing the Neural Cache architecture, which repurposes cache structures into compute units to reduce data movement and execute layers in-cache, resulting in improvements such as 18.3x lower latency over a CPU and 50% power reduction.

This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3x over state-of-art multi-core CPU (Xeon E5), 7.7x over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4x over CPU (2.2x over GPU), while reducing power consumption by 50% over CPU (53% over GPU).

View on arXiv PDF

Similar