DC AI LGJan 2, 2025

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze

OpenAIUW

arXiv:2501.01005v234.7240 citationsh-index: 27Has CodeMLSys

Originality Incremental advance

AI Analysis

This addresses the need for high-throughput, low-latency inference in diverse LLM applications, representing an incremental improvement through optimized memory access and scheduling.

The paper tackles the problem of inefficient GPU attention kernels in large language model (LLM) inference serving by introducing FlashInfer, an attention engine that uses block-sparse KV-cache storage and customizable templates, resulting in latency reductions of 29-69% for inter-token latency, 28-30% for long-context inference, and 13-17% speedup for parallel generation compared to state-of-the-art solutions.

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

View on arXiv PDF Code

Similar