AIJan 14, 2025

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

arXiv:2501.07959v25.81 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This work addresses security vulnerabilities in LLMs for AI safety researchers, but it is incremental as it builds on prior jailbreaking techniques.

The paper tackles the inefficiency of few-shot jailbreaking attacks on large language models by proposing Self-Instruct Few-Shot Jailbreaking, which decomposes the attack into pattern and behavior learning and uses greedy search, achieving improved efficiency compared to baseline methods.

Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.

View on arXiv PDF Code

Similar