LGAIOCJan 22, 2025

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

arXiv:2501.12956v35 citationsh-index: 4ICML
Originality Incremental advance
AI Analysis

This addresses deployment efficiency for large language models, offering incremental improvements in quantization methods.

The paper tackles the challenge of efficiently deploying large language models by proposing GANQ, a GPU-adaptive non-uniform quantization framework that reduces the perplexity gap from FP16 baselines and achieves up to 2.57x speedup on a single GPU.

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes