LG AINov 24, 2025

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

arXiv:2511.19517v15 citations

Originality Incremental advance

AI Analysis

This addresses a critical safety issue for LLM developers and users by exposing vulnerabilities in current safety architectures, though it is incremental as it builds on existing psychological principles for attack generation.

The paper tackled the problem of multi-turn conversational attacks on LLMs by introducing an automated pipeline to generate large-scale jailbreak datasets, revealing that GPT models are significantly vulnerable with Attack Success Rates increasing by up to 32 percentage points, while Gemini 2.5 Flash shows near immunity.

Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google's Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic's Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.

View on arXiv PDF

Similar