CRAICLLGFeb 5, 2025

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

arXiv:2502.05223v15 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of scalable red-teaming for LLM safety, though it is incremental as it builds on existing jailbreaking techniques.

The paper tackled the problem of costly and impractical jailbreak attacks on LLMs by proposing a Knowledge-Distilled Attacker (KDA) that distills ensemble knowledge into a single model to generate diverse prompts automatically, achieving higher attack success rates and greater cost-time efficiency compared to existing methods.

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes