LGAIMay 12, 2024

Post Training Quantization of Large Language Models with Microscaling Formats

arXiv:2405.07135v312 citationsh-index: 12ENLSP
Originality Synthesis-oriented
AI Analysis

This work addresses efficiency problems for LLM deployment, but is incremental as it combines and extends existing quantization methods.

This paper tackles the computational and storage challenges of Large Language Models by combining three existing post-training quantization techniques (SmoothQuant, AWQ, GPTQ) and extending them to microscaling formats, achieving 4-bit weight and 8-bit activation quantization with negligible accuracy loss compared to uncompressed baselines.

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these methods by enabling quantization to microscaling (MX) formats, extending the applicability of these PTQ algorithms beyond their original fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss compared to the uncompressed baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes