CLJan 17, 2024

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

arXiv:2401.09002v632 citationsh-index: 15SIGKDD Explorations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for more nuanced security assessments of LLMs against jailbreak attacks, though it appears incremental as it builds on existing evaluation concepts.

The paper tackles the problem of evaluating jailbreak attacks on large language models by introducing a framework with coarse-grained and fine-grained evaluations using a 0-1 scoring range, and it shows results aligning with baseline metrics while identifying harmful prompts missed by traditional methods.

Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs). To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the attacking prompts' effectiveness. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset is a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in prompt injection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes