LGINS-DETMay 1, 2024

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip

arXiv:2405.00645v212 citationsh-index: 122FPGA
AI Analysis

This work addresses deployment efficiency for neural networks on hardware like FPGAs and ASICs, offering a novel method for mixed-precision quantization.

The paper tackles the challenge of model size and inference speed in deep learning by introducing High Granularity Quantization (HGQ), a gradient-based method for mixed-precision quantization that fine-tunes per-weight and per-activation precision, achieving up to 20x resource reduction and 5x latency improvement while preserving accuracy.

Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method that could fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations with an arbitrary number of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes