AIJan 14, 2025

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

arXiv:2501.07959v21 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses security vulnerabilities in LLMs for AI safety researchers, but it is incremental as it builds on prior jailbreaking techniques.

The paper tackles the inefficiency of few-shot jailbreaking attacks on large language models by proposing Self-Instruct Few-Shot Jailbreaking, which decomposes the attack into pattern and behavior learning and uses greedy search, achieving improved efficiency compared to baseline methods.

Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes