CLDec 22, 2024

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

arXiv:2412.17034v222 citationsh-index: 47
Originality Incremental advance
AI Analysis

This addresses a major security concern for LLM users by providing an effective defense against jailbreak attacks, though it is incremental as it builds on existing understanding of safety boundaries.

The paper tackles the problem of jailbreaking in Large Language Models by analyzing seven jailbreak methods and proposing a novel defense called Activation Boundary Defense (ABD), which achieves over 98% average DSR against attacks with less than 2% impact on general capabilities.

Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes