DCAILGJan 2, 2025

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

OpenAIUW
arXiv:2501.01005v2222 citationsh-index: 27MLSys
AI Analysis

This addresses the need for high-throughput, low-latency inference in diverse LLM applications, representing an incremental improvement through optimized memory access and scheduling.

The paper tackles the problem of inefficient GPU attention kernels in large language model (LLM) inference serving by introducing FlashInfer, an attention engine that uses block-sparse KV-cache storage and customizable templates, resulting in latency reductions of 29-69% for inter-token latency, 28-30% for long-context inference, and 13-17% speedup for parallel generation compared to state-of-the-art solutions.

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes