CL AIMar 17

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

arXiv:2604.0855772.5

AI Analysis

This exposes a critical security flaw in diffusion-based language models for AI safety applications, showing their alignment is architecturally shallow rather than adversarially robust.

The paper demonstrates that safety-aligned diffusion language models are vulnerable to a simple attack that re-masks refusal tokens and injects an affirmative prefix, achieving 76.1% attack success rate on HarmBench against LLaDA-8B-Instruct and 81.8% against Dream-7B-Instruct, revealing that their safety relies on a fragile architectural assumption.

Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.

View on arXiv PDF

Similar