MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
This addresses the challenge of robust security for LLMs in multi-turn interactions, which is an incremental advancement in safety alignment.
The paper tackles the problem of securing large language models (LLMs) against hidden malicious intentions in multi-round dialogues by proposing the MTSA framework, which uses thought-guided attack learning and adversarial iterative optimization, resulting in state-of-the-art attack capabilities for the red-team model and significant safety improvements for the target model on benchmarks.
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.