Designing strong baselines for ternary neural network quantization through support and mass equalization
This work addresses the computational burden of deep neural networks for applications in computer vision by improving ternary quantization methods, though it appears incremental as it builds on existing quantization frameworks.
The paper tackled the problem of ternary neural network quantization by addressing the limitations of rounding to nearest, which does not account for weight distribution skewness and kurtosis, and introduced TQuant and MQuant operators to minimize quantization errors, resulting in significant performance improvements across data-free, post-training, and quantization-aware training scenarios.
Deep neural networks (DNNs) offer the highest performance in a wide range of applications in computer vision. These results rely on over-parameterized backbones, which are expensive to run. This computational burden can be dramatically reduced by quantizing (in either data-free (DFQ), post-training (PTQ) or quantization-aware training (QAT) scenarios) floating point values to ternary values (2 bits, with each weight taking value in {-1,0,1}). In this context, we observe that rounding to nearest minimizes the expected error given a uniform distribution and thus does not account for the skewness and kurtosis of the weight distribution, which strongly affects ternary quantization performance. This raises the following question: shall one minimize the highest or average quantization error? To answer this, we design two operators: TQuant and MQuant that correspond to these respective minimization tasks. We show experimentally that our approach allows to significantly improve the performance of ternary quantization through a variety of scenarios in DFQ, PTQ and QAT and give strong insights to pave the way for future research in deep neural network quantization.