Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
For LLM safety practitioners, this provides a robust detection method against sophisticated jailbreak attacks that evade existing defenses.
The paper reveals that refusal in LLMs is a dynamic, sparse process with a persistent upstream signature (Refusal Trajectory) that survives adversarial suppression of terminal signals. The proposed detector SALO improves jailbreak detection rates from ~0% to >90% against forced-decoding attacks.
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.