CLAICRLGSEMar 12, 2024

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

arXiv:2403.07865v575 citationsh-index: 13ACL
Originality Incremental advance
AI Analysis

This work identifies a new safety risk in LLMs for code domains, highlighting a critical vulnerability that could lead to misuse, though it is incremental in focusing on a specific domain rather than a broad solution.

The paper tackles the problem of safety generalization in large language models (LLMs) by introducing CodeAttack, a framework that transforms natural language inputs into code to test vulnerabilities, revealing that it bypasses safety guardrails over 80% of the time across models like GPT-4 and Claude-2.

The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a new and universal safety vulnerability of these models against code input: CodeAttack bypasses the safety guardrails of all models more than 80\% of the time. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures. Furthermore, we give our hypotheses about the success of CodeAttack: the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk. Finally, we analyze potential mitigation measures. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes