LGAICLMay 17, 2025

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

arXiv:2506.06295v1129 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses a critical bottleneck for users of dLLMs by making their inference latency competitive with autoregressive models, though it is incremental as it adapts existing caching ideas to a new model type.

The paper tackles the high inference latency of diffusion-based Large Language Models (dLLMs) by proposing dLLM-Cache, an adaptive caching framework that achieves up to 9.1x speedup without compromising output quality.

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes