CLMay 31

Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

arXiv:2606.0099724.6
AI Analysis

For researchers working on discrete diffusion models and order-agnostic generation, this work highlights a fundamental inconsistency in OALM likelihoods and provides a practical diagnostic for comparing decoding paths.

Order-agnostic language models (OALMs) produce conditionals that are not exact factorizations of a coherent joint distribution, with reveal order shifting log-likelihood by up to 0.49 nats/token. The paper proposes a variance-based diagnostic for decoding paths and shows that low variance is associated with better downstream performance.

Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates $\mathrm{Var}(\log q_t)$ as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes