CL AIFeb 21, 2025

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue

arXiv:2502.15443v124 citationsh-index: 5EMNLP

Originality Incremental advance

AI Analysis

This work addresses memory constraints for deploying LLMs on resource-limited devices, presenting an incremental improvement over existing quantization methods.

The paper tackles the memory challenge of deploying large language models on memory-limited devices by introducing a double compression framework that achieves a 2.2x compression ratio, reducing memory size by 40% with negligible accuracy and speed loss.

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.

View on arXiv PDF

Similar