CLJul 29, 2024

Can Editing LLMs Inject Harm?

arXiv:2407.20224v426 citationsh-index: 20
Originality Highly original
AI Analysis

This work highlights a new safety threat for LLMs, showing how editing techniques can compromise alignment and spread harmful content, which is a critical concern for AI security and ethics.

The paper investigates whether knowledge editing can bypass safety alignment in LLMs to inject harmful information, finding that editing attacks effectively inject misinformation and bias, with high stealthiness and significant impacts on fairness.

Large Language Models (LLMs) have emerged as a new information channel. Meanwhile, one critical but under-explored question is: Is it possible to bypass the safety alignment and inject harmful information into LLMs stealthily? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the first risk, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the effectiveness for the former one is particularly high. For the second risk, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can degrade the overall fairness. Then, we further illustrate the high stealthiness of editing attacks. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes