CLDec 24, 2024

LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong, Yongtao Tang

arXiv:2412.18135v23.43 citationsh-index: 9IJCNN

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient LLM deployment on resource-limited edge devices, offering an incremental improvement over existing quantization methods.

The paper tackles the problem of deploying large language models on edge devices with varying computational resources by proposing LSAQ, a layer-specific adaptive quantization system that adjusts precision based on layer importance, resulting in improved perplexity and zero-shot task performance compared to baselines.

As Large Language Models (LLMs) demonstrate exceptional performance across various domains, deploying LLMs on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory requirements of LLMs, limiting their applications to practical edge devices with various computation resources. To tackle this issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. Specifically, LSAQ evaluates the importance of LLMs' neural layers by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard similarity. Based on layer importance, our system adaptively adjusts quantization strategies in real time according to the computation resource of edge devices, which applies higher quantization precision to layers with higher importance, and vice versa. {Experimental results show that LSAQ consistently outperforms the selected quantization baselines in terms of perplexity and zero-shot tasks. Additionally, it can devise appropriate quantization schemes for different usage scenarios to facilitate the deployment of LLMs.

View on arXiv PDF

Similar