CRAICLLGFeb 8, 2024

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

arXiv:2402.05668v3115 citationsh-index: 17ACL
Originality Incremental advance
AI Analysis

This work provides a comprehensive benchmark for assessing jailbreak attacks and defenses, helping the community avoid incremental research.

The paper conducted a large-scale evaluation of 17 jailbreak attacks on nine aligned LLMs, revealing patterns like heuristic-based attacks having high success rates but low practicality against defenses.

Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation -- either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to mitigate by defenses, causing low practicality. Our study offers valuable insights for future research on jailbreak attacks and defenses. We hope our work could help the community avoid incremental work and serve as an effective benchmark tool for practitioners.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes