CLAIMar 17, 2025

ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

arXiv:2503.13089v22 citationsh-index: 21ACL
Originality Highly original
AI Analysis

This addresses the need for efficient deployment and accessibility of large models, offering a novel compression method with practical gains.

The paper tackles the problem of model compression for large language models by proposing ClusComp, a paradigm that clusters weight matrices into codebooks and finetunes them block-by-block, achieving superior performance in 2-4 bit quantization and enabling efficient finetuning that rivals full FP16 finetuning on 70B LLMs with a single GPU.

As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes