LG CL PFJul 7, 2023

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

arXiv:2307.03738v19.87 citationsh-index: 41Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and accurate quantized inference on LLMs for users with off-the-shelf CPUs, though it appears incremental as it builds on existing quantization methods.

The paper tackles the problem of generating efficient kernels for quantized inference on large language models like LLaMA and OPT on CPUs, achieving high performance and accuracy compared to existing open-source solutions.

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.

View on arXiv PDF Code

Similar