LG AI ARApr 19, 2025

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna

arXiv:2504.14365v17 citationsh-index: 23Has CodeISLPED

Originality Highly original

AI Analysis

This work addresses hardware inefficiencies in deploying sparse LLMs, offering significant performance gains for AI inference applications, though it is incremental as it builds on existing sparsity and accelerator techniques.

The paper tackles the problem of limited expressivity in LLM pruning with fixed N:M sparsity by proposing a flexible layer-wise outlier-density-aware sparsity selection method (FLOW) and a digital compute-in-memory accelerator (FlexCiM), resulting in up to 36% accuracy improvement, 1.75x lower inference latency, and 1.5x lower energy consumption compared to existing methods.

Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW

View on arXiv PDF Code

Similar