LGCLPFJul 7, 2023

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

arXiv:2307.03738v17 citationsh-index: 41Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and accurate quantized inference on LLMs for users with off-the-shelf CPUs, though it appears incremental as it builds on existing quantization methods.

The paper tackles the problem of generating efficient kernels for quantized inference on large language models like LLaMA and OPT on CPUs, achieving high performance and accuracy compared to existing open-source solutions.

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes