How Much Parallelism Is "Free"? A Principle of Near-Free Parallelism for Parallel Decoding
This work provides a crucial system-side budget for parallelism selection and model-system co-design for researchers and engineers working on efficient parallel decoding in large language models and diffusion models.
This paper investigates the system-side cost of parallel decoding, introducing Near-Free Parallelism (NFP) to quantify the maximum number of positions that can be processed with minimal latency. It finds that NFP is determined by both memory-bound resource slack and implementation-induced kernel-granularity slack, leading to a principle that accurately predicts NFP boundaries and reveals that traditional idle-compute intuitions can over-predict by up to 23x.
Parallel decoding improves generation efficiency by processing multiple decode positions within a single decode forward, but reported speedups conflate algorithmic token utilization with the system cost of executing multiple positions. We isolate the system side by introducing Near-Free Parallelism (NFP), the maximum number of positions executable at near-free latency. Analyzing Dense FFNs, MoE FFNs, and Attention against an idle-compute baseline, we find that NFP is shaped not by memory-bound resource slack alone, but also by implementation-induced kernel-granularity slack. Based on these mechanisms, we establish a Near-Free Parallelism principle that predicts the NFP boundary from hardware balance and implementation granularity. Validation on representative Dense and MoE models -- spanning both diffusion and autoregressive decoding -- shows that the principle accurately predicts practical NFP boundaries, revealing that the standard idle-compute intuition can over-predict by up to 23x -- offering a system-side budget for parallelism selection and model-system co-design.