Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
It addresses the need for systematic insights into LLM quantization tradeoffs for researchers and practitioners deploying efficient AI systems, though it is incremental as it builds on existing quantization methods.
This paper tackles the problem of understanding the tradeoffs in quantizing large language models (LLMs) for efficient serving by developing an automated framework and evaluating 11 methods across various model sizes and GPU architectures, revealing task- and method-dependent tradeoffs with sensitivity to workload and hardware interactions.
Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.