LGJun 15, 2025

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models

Yan Sun, Qixin Zhang, Zhiyuan Yu, Xikun Zhang, Li Shen, Dacheng Tao

arXiv:2506.12876v19.41 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This work addresses inference efficiency bottlenecks in LLM deployment through a novel sparsity learning approach, offering incremental improvements over prior rule-based and gradient-driven methods.

The paper tackles the challenge of efficiently training large language models with (N:M)-sparsity for hardware-friendly acceleration by proposing MaskPro, a linear-space probabilistic framework that learns categorical distributions to generate sparsity patterns, achieving reduced training costs and improved performance compared to existing methods.

The rapid scaling of large language models (LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at https://github.com/woodenchild95/Maskpro.git.

View on arXiv PDF Code

Similar