LG NEJan 31, 2017

Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point

Naveen Mellempudi, Abhisek Kundu, Dipankar Das, Dheevatsa Mudigere, Bharat Kaul

arXiv:1701.08978v26.934 citations

Originality Incremental advance

AI Analysis

This addresses the computational cost problem for deploying deep learning models on resource-constrained devices, though it is incremental as it builds on existing quantization techniques.

The paper tackles efficient deep learning inference by proposing a cluster-based quantization method that converts pre-trained full-precision weights to ternary weights and constrains activations to 8 bits, achieving 71.8% TOP-1 accuracy on ResNet-101 (within 6% of full precision) and replacing ~85% of multiplications with 8-bit accumulations.

We propose a cluster-based quantization method to convert pre-trained full precision weights into ternary weights with minimal impact on the accuracy. In addition, we also constrain the activations to 8-bits thus enabling sub 8-bit full integer inference pipeline. Our method uses smaller clusters of N filters with a common scaling factor to minimize the quantization loss, while also maximizing the number of ternary operations. We show that with a cluster size of N=4 on Resnet-101, can achieve 71.8% TOP-1 accuracy, within 6% of the best full precision results while replacing ~85% of all multiplications with 8-bit accumulations. Using the same method with 4-bit weights achieves 76.3% TOP-1 accuracy which within 2% of the full precision result. We also study the impact of the size of the cluster on both performance and accuracy, larger cluster sizes N=64 can replace ~98% of the multiplications with ternary operations but introduces significant drop in accuracy which necessitates fine tuning the parameters with retraining the network at lower precision. To address this we have also trained low-precision Resnet-50 with 8-bit activations and ternary weights by pre-initializing the network with full precision weights and achieve 68.9% TOP-1 accuracy within 4 additional epochs. Our final quantized model can run on a full 8-bit compute pipeline, with a potential 16x improvement in performance compared to baseline full-precision models.

View on arXiv PDF

Similar