Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

arXiv:2605.0295818.7

Predicted impact top 31% in CR · last 90 daysOriginality Highly original

AI Analysis

For LLM safety practitioners, this provides a robust detection method against sophisticated jailbreak attacks that evade existing defenses.

The paper reveals that refusal in LLMs is a dynamic, sparse process with a persistent upstream signature (Refusal Trajectory) that survives adversarial suppression of terminal signals. The proposed detector SALO improves jailbreak detection rates from ~0% to >90% against forced-decoding attacks.

Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.

View on arXiv PDF

Similar