LGJul 11, 2025

Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling

Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon

arXiv:2507.08390v216 citationsh-index: 14

Originality Highly original

AI Analysis

This addresses the challenge of inference-time control for diffusion language models, offering a novel method for reward optimization without model retraining, though it is incremental in improving existing techniques.

The paper tackles the problem of steering generation in diffusion language models toward desired rewards without retraining, and introduces PG-DLM, an inference-time algorithm that outperforms prior methods in reward-guided tasks like toxicity and sentiment control while preserving perplexity.

Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity under reward optimization. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. We derive theoretical guarantees for convergence, including asymptotic consistency and variance bounds. Within this framework, we further analyze trade-offs across four key axes for inference-time scaling under fixed budgets: iterations, samples, denoising steps, and reward estimation. Our analysis shows scaling iterations achieves the best reward-perplexity trade-off. Empirically, PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base models across a wide range of compute budgets for reward-guided generation tasks including toxicity and sentiment control as well as linguistic acceptability.

View on arXiv PDF

Similar