LGAIOct 26, 2025

Aligning Diffusion Language Models via Unpaired Preference Optimization

arXiv:2510.23658v21 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of aligning diffusion language models for AI researchers and practitioners by offering a more data-efficient alternative to pairwise methods, though it is incremental as it builds on existing preference optimization techniques.

The paper tackles the challenge of aligning diffusion language models to human preferences without costly pairwise data by introducing ELBO-KTO, which combines an ELBO surrogate with an unpaired preference objective, achieving adjusted win rates of 65.9% and 62.3% on benchmarks and performing competitively on downstream tasks.

Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes