CVOct 2, 2018

Post-training 4-bit quantization of convolution networks for rapid-deployment

Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry

arXiv:1810.05723v323.7106 citationsHas Code

Originality Incremental advance

AI Analysis

This enables rapid deployment of efficient models for applications like edge computing, though it is incremental as it builds on existing quantization methods.

The paper tackles the problem of reducing memory and computational demands of convolutional neural networks through 4-bit quantization without fine-tuning or full datasets, achieving accuracy within a few percent of state-of-the-art baselines across various models.

Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of intermediate results, but it often requires the full datasets and time-consuming fine tuning to recover the accuracy lost after quantization. This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset. We target the quantization of both activations and weights and suggest three complementary methods for minimizing quantization error at the tensor level, two of whom obtain a closed-form analytical solution. Combining these methods, our approach achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models. The source code to replicate all experiments is available on GitHub: \url{https://github.com/submission2019/cnn-quantization}.

View on arXiv PDF Code

Similar