LGAICLJul 3, 2024

GPTQT: Quantize Large Language Models Twice to Push the Efficiency

arXiv:2407.02891v14 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the need for more efficient deployment of large language models, but it is incremental as it builds on existing quantization methods.

This paper tackles the problem of reducing memory usage and enhancing processing speed for large language models by introducing GPTQT, a post-training quantization method that expresses weights in 3-bit/2-bit, resulting in a 4.01 perplexity reduction on opt-66B and a 1.24 times speed increase on opt-30b compared to a baseline.

Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT's effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes