Neural Network Quantization for Efficient Inference: A Survey
It addresses the problem of deploying powerful neural networks in real-world, resource-constrained environments, but is incremental as it is a survey paper.
This paper surveys neural network quantization techniques developed over the last decade to reduce model size and complexity for efficient inference on resource-constrained devices, and proposes future research directions based on the comparison.
As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.