CLJan 16, 2024

Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models

arXiv:2401.08294v14 citations
Originality Incremental advance
AI Analysis

This addresses the problem of serving transformer models efficiently for users who need configurable inference engines, though it appears incremental relative to existing methods.

The authors tackled the problem of inefficient and inflexible inference for large language models by developing Inferflow, an inference engine that achieved compositional generalizability through modular atomic build-blocks, introduced 3.5-bit quantization as a tradeoff between 3-bit and 4-bit, and implemented hybrid model partitioning for multi-GPU inference to better balance speed and throughput.

We present Inferflow, an efficient and highly configurable inference engine for large language models (LLMs). With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code. Compared with most existing inference engines, Inferflow has some key features. First, by implementing a modular framework of atomic build-blocks and technologies, Inferflow is compositionally generalizable to new models. Second, 3.5-bit quantization is introduced in Inferflow as a tradeoff between 3-bit and 4-bit quantization. Third, hybrid model partitioning for multi-GPU inference is introduced in Inferflow to better balance inference speed and throughput than the existing partition-by-layer and partition-by-tensor strategies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes