Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

arXiv:2605.0295818.7
Predicted impact top 31% in CR · last 90 daysOriginality Highly original
AI Analysis

For LLM safety practitioners, this provides a robust detection method against sophisticated jailbreak attacks that evade existing defenses.

The paper reveals that refusal in LLMs is a dynamic, sparse process with a persistent upstream signature (Refusal Trajectory) that survives adversarial suppression of terminal signals. The proposed detector SALO improves jailbreak detection rates from ~0% to >90% against forced-decoding attacks.

Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes