LGAIJun 5

Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

arXiv:2602.0260022.72 citations
Originality Incremental advance
AI Analysis

For safety researchers, this work provides a mechanistic understanding of refusal dynamics in language models and a practical, lightweight jailbreak detection method.

Diffusion language models exhibit improved jailbreak robustness over autoregressive models due to their remasking sampling mechanism, which enables recovery from harmful intermediate generations. The proposed Step-Wise Refusal Internal Dynamics (SRI) signal enables a jailbreak detector that matches or outperforms existing baselines with negligible overhead.

Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decoding, competitive generation quality, and initial evidence of improved jailbreak robustness. Despite this progress, the role of sampling mechanisms in shaping refusal behavior remains poorly understood. To address this gap, we present a comprehensive study of step-wise refusal dynamics. We show that diffusion remasking can promote recovery from harmful intermediate generations, provide evidence that this behavior is tied to the sampling mechanism, and demonstrate that switching from AR to diffusion sampling improves jailbreak robustness, including under fixed model weights. To capture generation dynamics not observable at the text level, we propose the Step-Wise Refusal Internal Dynamics (SRI) signal. Consistent with our text-level findings, SRI shows that recovery fails primarily under AR sampling, with these failures often appearing anomalous relative to harmless generations in the SRI space. Based on this observation, we show that SRI enables a simple jailbreak detector that does not modify inference and generalizes to unseen attacks by training only on benign SRI signals. Our evaluation shows that this detector matches or outperforms existing jailbreak detection baselines while adding negligible overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes