DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference
This work addresses efficient inference for deep neural networks, offering an incremental improvement in quantization techniques for hardware acceleration.
The paper tackles the challenge of quantizing deep neural networks to low bitwidths without significant accuracy loss by introducing DyBit, a dynamic bit-precision number representation that adapts to weight and activation distributions, resulting in 1.997% higher accuracy than state-of-the-art at 4-bit quantization and up to 8.1x speedup.
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.