CLAIMar 18, 2024

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

arXiv:2403.12171v163 citationsh-index: 40
Originality Synthesis-oriented
AI Analysis

This addresses the lack of a standard framework for jailbreak attacks, enabling more comprehensive security evaluations for LLM researchers and developers, though it is incremental as it builds on existing methods.

The paper introduces EasyJailbreak, a unified framework for constructing and evaluating jailbreak attacks on Large Language Models, revealing significant vulnerabilities with an average breach probability of 60% across 10 LLMs, including GPT-3.5-Turbo at 57% and GPT-4 at 33% attack success rates.

Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes