QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads
This work addresses the need for efficient deployment of large generative models on hardware with constraints, offering a domain-specific solution for AI practitioners.
The authors tackled the problem of quantizing generative AI workloads like LLMs and VLMs to low bit-widths (down to 3-bit) with minimal performance loss, achieving results within 6% of unquantized models and outperforming state-of-the-art techniques.
We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. We further integrate one particular technique from QuantX into the popular Llama.cpp framework and show its feasibility in terms of runtime compared to the mainstream quantization techniques from Llama.cpp. Lastly, this manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.