LGAIDCDSNov 17, 2025

MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity

arXiv:2511.13061v15 citationsh-index: 3
Originality Highly original
AI Analysis

This enables efficient unstructured pruning for real-world LLM inference, addressing a bottleneck in memory and speed for AI applications.

The paper tackled the problem of inefficient Sparse Matrix-Vector Multiplication (SpMV) for low and unstructured sparsity (30-90%) in pruned Large Language Models (LLMs), which limited memory reduction and speedup, and proposed MACKO-SpMV, a GPU-optimized format and kernel that achieved 1.5x memory reduction and 1.2-1.5x speedup over dense representation at 50% sparsity, with speedups of 2.8-13.0x over cuSPARSE and 1.5x faster inference for Llama2-7B.

Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes