LGAIARPFFeb 18, 2025

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

arXiv:2502.12444v13 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of making LLM inference more efficient and accessible on widely available CPUs, which is incremental but offers practical speed improvements.

The paper tackles the high compute and latency of large language models on CPUs by using Advanced Matrix Extensions and unstructured sparsity to accelerate token generation, achieving a 1.42x reduction in end-to-end latency and a 1.14x speedup in attention computation without accuracy loss.

Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 \times$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes