CLAILGNov 6, 2023

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

ETH Zurich
arXiv:2311.03348v2221 citationsh-index: 17
AI Analysis

This work exposes a critical security flaw in commercial AI systems, posing risks for misuse in generating harmful content, and is incremental as it builds on existing jailbreak techniques.

The paper tackled the vulnerability of large language models to jailbreak prompts by introducing persona modulation as a black-box method to steer models into harmful behaviors, achieving a harmful completion rate of 42.5% in GPT-4, which is 185 times higher than before modulation.

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes