CLMay 31, 2025

SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

Martin Kuo, Jianyi Zhang, Aolin Ding, Louis DiValentin, Amin Hass, Benjamin F Morris, Isaac Jacobson, Randolph Linderman, James Kiessling, Nicolas Ramos, Bhavna Gopal, Maziyar Baran Pouyan

arXiv:2506.00668v113.07 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses safety risks in LLMs for society by defending against multi-turn attacks, representing a strong specific gain in a domain-specific area.

The paper tackles the problem of malicious attackers exploiting large language models (LLMs) in multi-turn dialogues to achieve harmful objectives, proposing STREAM, a defense mechanism that reduces the Attack Success Rate by 51.2% while preserving LLM capabilities.

Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.

View on arXiv PDF

Similar