LGCLCRCYAug 27, 2024

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

arXiv:2408.15221v2146 citationsh-index: 11
Originality Incremental advance
AI Analysis

This highlights a critical security gap for AI safety practitioners, as current defenses fail against realistic multi-turn attacks, making it an incremental but important finding.

The paper tackles the problem that LLM defenses are vulnerable to multi-turn human jailbreaks, showing over 70% attack success rate on HarmBench against defenses with low single-turn automated attack rates, and reveals weaknesses in machine unlearning defenses by recovering biosecurity knowledge.

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes