DC LGNov 1, 2017

Efficient Inferencing of Compressed Deep Neural Networks

Dharma Teja Vooturi, Saurabh Goyal, Anamitra R. Choudhury, Yogish Sabharwal, Ashish Verma

arXiv:1711.00244v15.96 citations

Originality Incremental advance

AI Analysis

This work addresses deployment challenges in low-memory environments like mobile and IoT devices, but it is incremental as it builds on existing compression techniques.

The paper tackles the problem of efficient inference for compressed deep neural networks, particularly with Huffman encoding, by proposing parallel algorithms for single image and batch inference under memory constraints, achieving 15-25% throughput improvement for AlexNet.

Large number of weights in deep neural networks makes the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on cloud. Prior work has considered reduction in the size of the models, through compression techniques like pruning, quantization, Huffman encoding etc. However, efficient inferencing using the compressed models has received little attention, specially with the Huffman encoding in place. In this paper, we propose efficient parallel algorithms for inferencing of single image and batches, under various memory constraints. Our experimental results show that our approach of using variable batch size for inferencing achieves 15-25\% performance improvement in the inference throughput for AlexNet, while maintaining memory and latency constraints.

View on arXiv PDF

Similar