LGMay 7

SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

arXiv:2605.0640258.1
AI Analysis

For practitioners deploying LLMs with hardware acceleration, SparseForge reduces the computational cost of sparse recovery by 8x while maintaining accuracy, addressing the bottleneck of post-training semi-structured pruning.

SparseForge improves post-training semi-structured sparsity for LLMs by directly optimizing the sparsity mask via Hessian-guided annealing, achieving 57.27% zero-shot accuracy on LLaMA-2-7B under 2:4 sparsity with only 5B retraining tokens, surpassing the dense model (56.43%) and approaching SOTA (57.52%) that uses 40B tokens.

Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only $\textbf{5B}$ retraining tokens, surpassing the dense model's 56.43% accuracy and approaching the 57.52% result of a state-of-the-art method using $\textbf{40B}$ tokens. Such improvements on the accuracy-efficiency trade-off from SparseForge are shown to be consistent across model families.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes