CLAICRFeb 16, 2025

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

arXiv:2502.11054v455 citationsh-index: 15Has CodeEMNLP
Originality Highly original
AI Analysis

This addresses critical safety vulnerabilities in LLMs for AI security researchers, though it is incremental as it builds on existing multi-turn attack methods.

The paper tackles the problem of multi-turn jailbreak attacks on large language models by proposing Reasoning-Augmented Conversation, which reformulates harmful queries into benign reasoning tasks, achieving state-of-the-art attack success rates up to 96% and 92% against models like OpenAI o1 and DeepSeek R1.

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes