CLSep 11, 2023

Understanding the Impact of Post-Training Quantization on Large Language Models

arXiv:2309.05210v36 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the deployment cost issue for large language models by providing insights into quantization effects, but it is incremental as it builds on existing quantization techniques.

The study tackled the problem of understanding how post-training quantization affects large language models' performance under different hyperparameters, finding that nf4 and fp4 are equally proficient 4-bit techniques with similar attributes, but nf4 shows greater resilience to temperature variations for llama2 models, while fp4 is better for falcon models, and 4-bit quantized models are more sensitive to temperature in the 0.5 to 0.8 range compared to unquantized ones.

Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. the study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature, while fp4 and fp4-dq proves to be a more suitable choice for falcon series of models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit higher sensitivity to temperature in the range of 0.5 to 0.8, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes