DCAILGNov 4, 2024

Context Parallelism for Scalable Million-Token Inference

arXiv:2411.01783v335 citationsh-index: 15MLSys
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient million-token inference for AI applications, representing an incremental improvement in parallelism techniques.

The paper tackles the problem of scaling long-context inference for large language models by introducing context parallelism, achieving near-linear scaling for prefill latency with up to 128 GPUs, such as 1M context prefill in 77 seconds for a 405B model.

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes