LGMLApr 15, 2020

Training with Quantization Noise for Extreme Model Compression

arXiv:2004.07320v3264 citations
AI Analysis

This work addresses the need for compact models in resource-constrained environments like mobile devices, offering significant improvements over existing compression methods.

The paper tackles the problem of extreme model compression by introducing a method that quantizes random subsets of weights during training to maintain unbiased gradients, achieving new state-of-the-art accuracy-size compromises, such as 82.5% accuracy on MNLI with a 14MB RoBERTa and 80.0% top-1 accuracy on ImageNet with a 3.3MB EfficientNet-B3.

We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and ConvNet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14MB and 80.0 top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3MB.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes