CLAIFeb 27, 2024

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

arXiv:2402.17262v270 citationsh-index: 9
Originality Incremental advance
AI Analysis

This exposes a critical safety issue for users of LLMs in interactive settings, highlighting incremental risks in multi-turn dialogue scenarios.

The paper tackles the safety vulnerability of Large Language Models (LLMs) in multi-turn dialogues, showing that by decomposing unsafe queries into sub-queries, LLMs can be induced to generate harmful responses, with experiments across various LLMs revealing inadequacies in their safety mechanisms.

Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes