LG CVJan 30, 2023

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Deepika Bablani, Jeffrey L. Mckinstry, Steven K. Esser, Rathinakumar Appuswamy, Dharmendra S. Modha

IBM

arXiv:2301.13330v25.311 citationsh-index: 42

Originality Incremental advance

AI Analysis

This work addresses the need for faster and more energy-efficient inference in neural networks, offering incremental improvements in quantization methods for practical deployment.

The paper tackles the problem of mixed precision neural network quantization for efficient inference by introducing two methods, EAGL and ALPS, to select layer precisions, achieving full-precision accuracy with a mix of 4-bit and 2-bit layers for models like ResNet-50 and BERT-base, and demonstrating better performance and significantly reduced computational time compared to existing techniques.

For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.

View on arXiv PDF

Similar