SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions
This work addresses the need for efficient model compression in AI applications on devices with limited resources, representing an incremental improvement over existing methods.
The paper tackles the problem of compressing large neural networks for deployment on resource-constrained devices by introducing SQS, a Bayesian variational learning framework that unifies pruning and low-bit quantization, achieving higher compression rates than prior methods while maintaining comparable performance.
Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.