CLAICYJan 18, 2024

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

arXiv:2401.09798v340 citationsAppl Sci
Originality Highly original
AI Analysis

This work highlights a heightened risk for AI safety by showing that effective jailbreak attacks can be simpler than previously thought, posing a threat to users relying on LLM safeguards.

The paper tackled the problem of jailbreaking large language models (LLMs) by introducing a simple black-box method that transforms harmful prompts into benign expressions, achieving over 80% attack success rate within five iterations on models like ChatGPT and Gemini-Pro.

Large Language Models (LLMs), such as ChatGPT, encounter `jailbreak' challenges, wherein safeguards are circumvented to generate ethically harmful prompts. This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts, addressing the significant complexity and computational costs associated with conventional methods. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM, predicated on the hypothesis that LLMs can autonomously generate expressions that evade safeguards. Through experiments conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions and proved robust against model updates. The jailbreak prompts generated were not only naturally-worded and succinct but also challenging to defend against. These findings suggest that the creation of effective jailbreak prompts is less complex than previously believed, underscoring the heightened risk posed by black-box jailbreak attacks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes