CL AISep 18, 2025

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo

arXiv:2509.14651v113.06 citationsh-index: 6Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses a critical safety issue for users of large language models in real-world multi-turn dialogues, representing an incremental improvement by focusing on a specific vulnerability beyond single-turn attacks.

The paper tackles the problem of multi-turn jailbreaks in large language models, where adversaries exploit conversational context to bypass safety measures, and introduces MUSE, a framework that effectively identifies and mitigates these vulnerabilities through attack and defense methods.

As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.

View on arXiv PDF Code

Similar