CLMay 3

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu, Jianfeng Gao

arXiv:2605.0168786.41 citations

Predicted impact top 46% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For LLM safety researchers, MultiBreak provides a diverse and scalable benchmark to evaluate and improve model robustness against realistic multi-turn attacks, addressing a critical gap in existing safety evaluations.

MultiBreak is a scalable multi-turn jailbreak benchmark with 10,389 adversarial prompts across 2,665 harmful intents, achieving up to 54.0% and 34.6% higher attack success rate than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively, revealing fine-grained LLM vulnerabilities in multi-turn settings.

We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and 34.6 higher attack success rate (ASR)} than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities}, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.

View on arXiv PDF

Similar