LGMar 5

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou

arXiv:2603.05232v11.4h-index: 16Has Code

Originality Highly original

AI Analysis

This work provides a practical method for accelerating LLMs with milder, accuracy-preserving sparsity patterns (e.g., 25% pruning) on existing NVIDIA hardware, addressing a limitation for LLM developers and users.

This paper introduces SlideSparse, a system that enables hardware acceleration for $(2N-2):2N$ structured sparsity patterns on NVIDIA Sparse Tensor Cores, which previously only supported 2:4 sparsity. By reconstructing weight blocks into overlapping 2:4-compliant windows and fusing activation rearrangement, SlideSparse achieves a 1.33x speedup on compute-bound workloads for 6:8 sparsity in Qwen2.5-7B, approaching the theoretical upper bound.

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

View on arXiv PDF Code

Similar