CLAIAug 28, 2024

FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)

arXiv:2408.16163v23 citationsh-index: 5
AI Analysis

This addresses the need for more robust defenses against subtle multi-turn jailbreak attacks on LLMs, though it builds incrementally on existing datasets.

The paper tackles the problem of evaluating LLM safety against multi-turn conversational attacks by introducing FRACTURED-SORRY-Bench, a framework that breaks harmful queries into innocuous sub-questions, achieving up to +46.22% increase in Attack Success Rates across GPT models compared to baselines.

This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions. Our approach achieves a maximum increase of +46.22\% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We demonstrate that this technique poses a challenge to current LLM safety measures and highlights the need for more robust defenses against subtle, multi-turn attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes