LGAIARApr 19, 2025

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

arXiv:2504.14365v17 citationsh-index: 23Has CodeISLPED
Originality Highly original
AI Analysis

This work addresses hardware inefficiencies in deploying sparse LLMs, offering significant performance gains for AI inference applications, though it is incremental as it builds on existing sparsity and accelerator techniques.

The paper tackles the problem of limited expressivity in LLM pruning with fixed N:M sparsity by proposing a flexible layer-wise outlier-density-aware sparsity selection method (FLOW) and a digital compute-in-memory accelerator (FlexCiM), resulting in up to 36% accuracy improvement, 1.75x lower inference latency, and 1.5x lower energy consumption compared to existing methods.

Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes