DCLGNov 1, 2017

Efficient Inferencing of Compressed Deep Neural Networks

arXiv:1711.00244v16 citations
Originality Incremental advance
AI Analysis

This work addresses deployment challenges in low-memory environments like mobile and IoT devices, but it is incremental as it builds on existing compression techniques.

The paper tackles the problem of efficient inference for compressed deep neural networks, particularly with Huffman encoding, by proposing parallel algorithms for single image and batch inference under memory constraints, achieving 15-25% throughput improvement for AlexNet.

Large number of weights in deep neural networks makes the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on cloud. Prior work has considered reduction in the size of the models, through compression techniques like pruning, quantization, Huffman encoding etc. However, efficient inferencing using the compressed models has received little attention, specially with the Huffman encoding in place. In this paper, we propose efficient parallel algorithms for inferencing of single image and batches, under various memory constraints. Our experimental results show that our approach of using variable batch size for inferencing achieves 15-25\% performance improvement in the inference throughput for AlexNet, while maintaining memory and latency constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes