CR AI CL LGFeb 8, 2024

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang

arXiv:2402.05668v333.0120 citationsh-index: 17Has CodeACL

Originality Incremental advance

AI Analysis

This work provides a comprehensive benchmark for assessing jailbreak attacks and defenses, helping the community avoid incremental research.

The paper conducted a large-scale evaluation of 17 jailbreak attacks on nine aligned LLMs, revealing patterns like heuristic-based attacks having high success rates but low practicality against defenses.

Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation -- either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to mitigate by defenses, causing low practicality. Our study offers valuable insights for future research on jailbreak attacks and defenses. We hope our work could help the community avoid incremental work and serve as an effective benchmark tool for practitioners.

View on arXiv PDF Code

Similar