CLFeb 25, 2024

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

arXiv:2402.16192v286 citationsh-index: 21Has CodeIJCNLP-AACL
AI Analysis

This addresses a critical security problem for users of large language models by providing a defense against semantic jailbreak attacks, though it is incremental as it builds on smoothing-based methods.

The paper tackles the vulnerability of aligned large language models to jailbreak attacks by proposing SEMANTICSMOOTH, a defense that aggregates predictions from semantically transformed copies of input prompts, achieving state-of-the-art robustness against attacks like GCG, PAIR, and AutoDAN while maintaining strong performance on benchmarks such as InstructionFollowing and AlpacaEval.

Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes