CLAILGDec 4, 2024

Best-of-N Jailbreaking

arXiv:2412.03556v259 citationsh-index: 39Has Code
AI Analysis

This work addresses security vulnerabilities in AI systems for developers and users, showing that even state-of-the-art models are susceptible to simple attacks, though it is incremental as it builds on existing black-box jailbreaking techniques.

The paper tackles the problem of jailbreaking frontier AI systems across modalities by introducing Best-of-N Jailbreaking, a simple black-box algorithm that achieves high attack success rates, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet with 10,000 augmented prompts, and extends effectively to vision and audio language models.

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes