LGAIDec 19, 2025

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

arXiv:2512.17970v11 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses efficiency bottlenecks in quantized LLM inference for deployment, though it is incremental as it builds on existing codebook-based methods.

The paper tackles the latency and cache pressure issues in codebook-based quantization for LLM inference by introducing CodeGEMM, a kernel that replaces dequantization with precomputed inner products, achieving speedups of 1.83x and 8.93x on Llama-3 models at 2-bit quantization with comparable accuracy.

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes