CR AI CLFeb 25, 2024

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

arXiv:2402.16914v332.5115 citationsh-index: 21Has CodeEMNLP

Originality Highly original

AI Analysis

This addresses security vulnerabilities in LLMs for developers and users, though it is an incremental improvement over existing jailbreak methods.

The paper tackles the problem of jailbreaking safety-aligned Large Language Models (LLMs) by introducing DrAttack, a method that decomposes malicious prompts into sub-prompts to obscure intent and reconstructs them via in-context learning, achieving a 78.0% success rate on GPT-4 with only 15 queries, surpassing prior state-of-the-art by 33.1%.

The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.

View on arXiv PDF Code

Similar