LGAIARNov 25, 2024

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

arXiv:2411.16158v19 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses deployment challenges for large language models by improving hardware efficiency for quantization, though it is incremental as it builds on existing quantization methods.

The paper tackles the inefficiency of mixed-precision matrix multiplication in LLM inference due to hardware limitations, introducing MixPE, a specialized processing element that achieves a 2.6x speedup and 1.4x energy reduction compared to state-of-the-art quantization accelerators.

Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by $2.6\times$ speedup and $1.4\times$ energy reduction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes