Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

Binglei Lou, Haoran Wu, Yao Lai, Jiayi Nie, Can Xiao, Xuan Guo, Rika Antonova, Robert Mullins, Aaron Zhao

arXiv:2601.20706v12.3h-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses a performance bottleneck in dLLM inference for AI hardware designers, offering a domain-specific incremental improvement.

The paper tackles the inefficiency of diffusion large language model (dLLM) sampling on conventional NPUs, which accounts for up to 70% of inference latency due to memory-intensive operations, and proposes an optimized NPU architecture with lightweight vector primitives and memory strategies, achieving a 2.53x speedup over a GPU.

Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.

View on arXiv PDF

Similar