QFT: Post-training quantization via fast joint finetuning of all degrees of freedom
This addresses the industry demand for efficient neural network quantization to reduce computational resources while maintaining accuracy, representing an incremental improvement over existing multi-step PTQ methods.
The paper tackles the problem of post-training quantization (PTQ) by proposing a method that jointly finetunes all quantization degrees of freedom in a single step, achieving 4-bit weight quantization results on-par with state-of-the-art methods under PTQ constraints.
The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific degree-of-freedom (DoF), such as quantization step size, preconditioning factors, bias fixing, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA within PTQ constraints of speed and resource.