LGDCOct 23, 2024

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

arXiv:2410.18038v272 citationsh-index: 47ASPLOS
Originality Incremental advance
AI Analysis

This addresses a bottleneck in GPU utilization for LLM inference systems, offering incremental improvements in throughput and latency.

The paper tackles the inefficiency in attention computation for hybrid batches in LLM inference by introducing POD-Attention, a GPU kernel that enables concurrent prefill and decode operations, resulting in speedups of up to 59% (mean 28%) for faster inference.

Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59\%$ (mean $28\%$), enabling higher throughput and lower latency LLM inference compared to the use of independently optimized prefill and decode attention kernels.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes