CLAIOct 14, 2024

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

arXiv:2410.10700v264 citationsh-index: 11Has CodeACL
Originality Highly original
AI Analysis

This addresses safety gaps in LLMs for users and developers by exposing vulnerabilities through novel attack methods, though it is incremental in expanding safety training approaches.

The paper identifies a new safety vulnerability in large language models (LLMs) where natural distribution shifts allow semantically related benign prompts to bypass safety mechanisms, and introduces ActorBreaker, an attack method that outperforms existing methods in diversity, effectiveness, and efficiency, while fine-tuning on a constructed dataset improves robustness with some utility trade-offs.

Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes