CLLGApr 18

Stability-Weighted Decoding for Diffusion Language Models

arXiv:2604.1706817.7h-index: 4
AI Analysis

For practitioners using diffusion language models, SWD offers a simple, plug-and-play method to enhance decoding quality without retraining.

The paper introduces Stability-Weighted Decoding (SWD), a training-free decoding strategy for diffusion language models that incorporates temporal stability of tokens to improve generation accuracy. SWD consistently outperforms standard baselines on code generation and mathematical reasoning benchmarks across various scoring metrics and acceleration ratios.

Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token's temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary score-based decoding policies. Experiments on code generation and mathematical reasoning benchmarks demonstrate that SWD consistently improves generation accuracy across representative scoring metrics and selection policies, and exhibits exceptional robustness, maintaining a significant performance lead over standard baselines across varying acceleration ratios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes