WaterSIC: information-theoretically (near) optimal linear layer quantization
This work provides a more efficient quantization method for large language models, which is significant for researchers and practitioners working on deploying LLMs with reduced computational and memory footprints.
This paper addresses the problem of quantizing dense linear layers to low precision, demonstrating that the popular GPTQ algorithm can be arbitrarily suboptimal compared to information-theoretic limits. They propose a new algorithm, WaterSIC, which achieves a rate gap of 0.255 bits to the information-theoretic limit and establishes new state-of-the-art performance for 1-4 bit quantization rates on Llama and Qwen LLMs.
This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.