CRAIMay 11

Re-Triggering Safeguards within LLMs for Jailbreak Detection

arXiv:2605.1061117.1
Predicted impact top 12% in CR · last 90 daysOriginality Incremental advance
AI Analysis

For LLM security, this method offers a cooperative defense that leverages existing safeguards, improving detection without requiring standalone solutions.

The paper proposes a jailbreak detection method that re-activates built-in safeguards in LLMs via embedding disruption, achieving effective defense against state-of-the-art attacks in both white-box and black-box settings while remaining robust to adaptive attacks.

This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approach instead cooperates with the LLM's internal defense mechanisms by re-triggering them. Moreover, through extensive analysis, we gain a comprehensive understanding of the disruption effects and develop an efficient search algorithm to identify appropriate disruptions for effective jailbreak detection. Extensive experiments demonstrate that our approach effectively defends against state-of-the-art jailbreak attacks in white-box and black-box settings, and remains robust even against adaptive attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes