CL AI LGSep 8, 2025

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

arXiv:2509.06836v3

Originality Highly original

AI Analysis

This addresses the need for efficient LLM deployment in edge and interactive applications, offering a novel hybrid pruning approach that maintains standard transformer layouts and adapts across scales.

The paper tackles the problem of making large language models more efficient for deployment by proposing COMPACT, a pruning method that jointly prunes rare vocabulary and FFN intermediate channels, resulting in state-of-the-art downstream performance with substantial reductions in parameters, GPU memory, and latency across models from 0.5B to 70B parameters.

Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.

View on arXiv PDF

Similar