CRAICLLGJan 29, 2025

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Georgia Tech
arXiv:2501.17433v130 citationsh-index: 14Has Code
Originality Highly original
AI Analysis

This work highlights a critical security vulnerability for LLM developers and users, as it demonstrates that current safety measures are insufficient against adversarial attacks.

The paper tackles the problem of harmful fine-tuning attacks on Large Language Models (LLMs) by showing that guardrail moderation for filtering harmful data is unreliable, as their Virus attack method bypasses it with up to 100% leakage ratio and achieves superior attack performance.

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes