On the quantization of recurrent neural networks
This work addresses the need for reduced memory and faster computation in production ML systems, though it is incremental as it applies known quantization techniques specifically to LSTMs.
The paper tackles the problem of efficiently deploying recurrent neural networks by introducing an integer-only quantization strategy for LSTM topologies, achieving accurate results with 8-bit integer weights and activations while targeting various hardware.
Integer quantization of neural networks can be defined as the approximation of the high precision computation of the canonical neural network formulation, using reduced integer precision. It plays a significant role in the efficient deployment and execution of machine learning (ML) systems, reducing memory consumption and leveraging typically faster computations. In this work, we present an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies, which themselves are the foundation of many production ML systems. Our quantization strategy is accurate (e.g. works well with quantization post-training), efficient and fast to execute (utilizing 8 bit integer weights and mostly 8 bit activations), and is able to target a variety of hardware (by leveraging instructions sets available in common CPU architectures, as well as available neural accelerators).