FOCUS: DLLMs Know How to Tame Their Compute Bound
This work addresses a key deployment bottleneck for DLLMs, enabling scalable throughput, though it is incremental as it optimizes an existing method rather than introducing a new paradigm.
The paper tackles the high decoding cost in Diffusion Large Language Models (DLLMs) by identifying inefficiencies where most compute is wasted on non-decodable tokens, and proposes FOCUS, an inference system that dynamically focuses computation on decodable tokens, achieving up to 3.52x throughput improvement over LMDeploy while preserving generation quality.
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands-lab/FOCUS.