LG AIApr 8, 2025

Mosaic: Composite Projection Pruning for Resource-efficient LLMs

Bailey J. Eccles, Leon Wong, Blesson Varghese

arXiv:2504.06323v24.14 citationsh-index: 3Has CodeFuture generations computer systems

Originality Highly original

AI Analysis

This addresses the deployment challenges of LLMs on resource-constrained hardware, offering a novel pruning method that improves efficiency and accuracy, though it is incremental in the context of existing compression techniques.

The paper tackles the problem of high compute and memory requirements for deploying large language models (LLMs) by introducing Mosaic, a system using composite projection pruning, which results in models that are up to 7.19x faster to produce, achieve up to 84.2% lower perplexity and 31.4% higher accuracy than coarse-grained pruning, and offer up to 67% faster inference with 68% lower GPU memory use.

Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models. Mosaic is available for public use from https://github.com/blessonvar/Mosaic

View on arXiv PDF Code

Similar