Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models
This addresses a fundamental limitation in language generation for tasks with structured outputs, offering a more flexible approach, though it is incremental in improving model robustness.
The paper tackles the problem of autoregressive language models struggling when output order conflicts with natural reasoning, such as requiring answers before explanations, by showing that masked diffusion language models maintain stable accuracy with minimal drops (≤14% relative) compared to large gaps in AR models (up to 67% relative drop).
Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.