Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
This work addresses the need for efficient model compression in Large Language Models to reduce memory and computation costs, offering a novel method that enhances GPU compatibility and performance, though it is incremental in improving upon existing low-rank pruning techniques.
The paper tackles the performance gap between low-rank pruning and semi-structured pruning in Large Language Models by proposing Pivoting Factorization (PIFA), a lossless meta low-rank representation that reduces memory by 24.2% and speeds up inference by 24.6% at rank = 50% of dimension, with MPIFA achieving performance comparable to semi-structured pruning while improving GPU efficiency.
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods, and achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility. Our code is available at https://github.com/biomedical-cybernetics/pivoting-factorization.