CV LG MLOct 18, 2025

Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch

arXiv:2510.16088v3

Originality Highly original

AI Analysis

This work addresses the need for efficient, scalable quantization without retraining from scratch, offering a novel solution for deployment in resource-constrained environments.

The paper tackles the problem of neural network quantization by introducing a differentiable, bit-shifting method that scales to n bits, achieving less than 1% accuracy loss compared to full precision on ImageNet with ResNet18 in 15 epochs and competitive SOTA results with weight and activation quantization.

Quantization of neural networks provides benefits of inference in less compute and memory requirements. Previous work in quantization lack two important aspects which this work provides. First almost all previous work in quantization used a non-differentiable approach and for learning; the derivative is usually set manually in backpropogation which make the learning ability of algorithm questionable, our approach is not just differentiable, we also provide proof of convergence of our approach to the optimal neural network. Second previous work in shift/logrithmic quantization either have avoided activation quantization along with weight quantization or achieved less accuracy. Learning logrithmic quantize values of form $2^n$ requires the quantization function can scale to more than 1 bit quantization which is another benifit of our quantization that it provides $n$ bits quantization as well. Our approach when tested with image classification task using imagenet dataset, resnet18 and weight quantization only achieves less than 1 percent accuracy compared to full precision accuracy while taking only 15 epochs to train using shift bit quantization and achieves comparable to SOTA approaches accuracy in both weight and activation quantization using shift bit quantization in 15 training epochs with slightly higher(only higher cpu instructions) inference cost compared to 1 bit quantization(without logrithmic quantization) and not requiring any higher precision multiplication.

View on arXiv PDF

Similar