Badllama 3: removing safety finetuning from Llama 3 in minutes
This reveals a critical vulnerability in LLM safety mechanisms, posing risks for deployment in secure or ethical applications.
The authors demonstrated that extensive safety fine-tuning in large language models like Llama 3 can be easily subverted by attackers with access to model weights, achieving jailbreaking in as little as one minute for an 8B model and 30 minutes for a 70B model on a single GPU.
We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.