LGAIDec 10, 2024

Post-Training Statistical Calibration for Higher Activation Sparsity

arXiv:2412.07174v18 citationsh-index: 3Has CodeENLSP
Originality Incremental advance
AI Analysis

This work addresses efficiency improvements for large language models and other architectures, though it appears incremental as it builds on existing pruning frameworks.

The paper tackles the problem of increasing activation sparsity in neural networks post-training, achieving a 1.5x additional LLM decoding speedup compared to prior methods while maintaining model quality.

We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5x additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at: https://github.com/IntelLabs/SCAP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes