LGAICLMar 2, 2024

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

arXiv:2403.01273v121 citationsh-index: 16Has CodeNIPS
Originality Incremental advance
AI Analysis

This addresses the problem of slow LLM inference on CPUs for users needing efficient deployment without model finetuning, though it is an incremental improvement focused on a specific hardware optimization.

The paper tackles the challenge of efficient large language model inference on CPUs by proposing NoMAD-Attention, an algorithm that replaces multiply-add operations with in-register lookups using SIMD registers, resulting in a speedup of up to 2× for a 4-bit quantized LLaMA-7B model at 16k context length while maintaining model quality.

Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes