Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
This addresses a critical security problem for LLM providers and users by mitigating alignment-breaking effects from user-uploaded harmful data, representing an incremental improvement in defense mechanisms.
The paper tackles the security risk of harmful fine-tuning attacks on Large Language Models (LLMs) by proposing Vaccine, a perturbation-aware alignment technique that boosts robustness against harmful prompts while preserving reasoning ability for benign ones, with results demonstrated on models like Llama2, Opt, and Vicuna.
The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.