Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
This work addresses security vulnerabilities in frontier AI models, though it is incremental as it focuses on dataset creation rather than novel defense methods.
The paper tackles the problem of jailbreak attacks on large language models by introducing a dataset with both single-turn and multi-turn formats, showing that content-equivalent inputs have different jailbreak success rates and that defenses are structure-dependent.
Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.