CLDec 22, 2024

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen

arXiv:2412.17034v213.522 citationsh-index: 47

Originality Incremental advance

AI Analysis

This addresses a major security concern for LLM users by providing an effective defense against jailbreak attacks, though it is incremental as it builds on existing understanding of safety boundaries.

The paper tackles the problem of jailbreaking in Large Language Models by analyzing seven jailbreak methods and proposing a novel defense called Activation Boundary Defense (ABD), which achieves over 98% average DSR against attacks with less than 2% impact on general capabilities.

Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.

View on arXiv PDF

Similar