CL AISep 10, 2025

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

arXiv:2509.08729v36.72 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and reproducible jailbreak probes in AI safety, though it is incremental as it builds on prior manual template methods.

The paper tackled the problem of automating the discovery of multi-turn to single-turn jailbreak templates for red-teaming language models, achieving a 44.8% success rate on GPT-4.1 with an automated evolutionary framework.

Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

View on arXiv PDF Code

Similar