LGCLJun 20, 2023

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Georgia TechMicrosoft
arXiv:2306.11222v2129 citationsh-index: 62
Originality Incremental advance
AI Analysis

This addresses the resource constraints for deploying large language models, though it is incremental as it builds on existing compression techniques like low-rank approximation and pruning.

The paper tackles the problem of large transformer models being too resource-intensive by proposing LoSparse, a compression technique that approximates weight matrices as a sum of low-rank and sparse matrices, and shows it significantly outperforms existing methods on tasks like natural language understanding and generation.

Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes