CR AI CLAug 17, 2025

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position

arXiv:2508.12398v117.69 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses safety alignment for a novel non-autoregressive architecture, which is an incremental but important step for secure AI deployment.

The paper tackles the lack of safety studies in diffusion large language models (dLLMs) by analyzing their safety performance and proposing a novel alignment method, resulting in superior security against attacks and maintained utility in tasks like coding and math.

Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs' safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a sequential generation order in practice, forcing the attack to meet this distribution and diverting it from influencing the critical middle tokens. Building on this asymmetry, we introduce Middle-tOken Safety Alignment (MOSA), a novel method that directly aligns the model's middle generation with safe refusals exploiting reinforcement learning. We implement MOSA and compare its security performance against eight attack methods on two benchmarks. We also test the utility of MOSA-aligned dLLM on coding, math, and general reasoning. The results strongly prove the superiority of MOSA.

View on arXiv PDF

Similar