LG AI CL CRJul 1, 2024

Badllama 3: removing safety finetuning from Llama 3 in minutes

arXiv:2407.01376v19 citationsh-index: 2

Originality Incremental advance

AI Analysis

This reveals a critical vulnerability in LLM safety mechanisms, posing risks for deployment in secure or ethical applications.

The authors demonstrated that extensive safety fine-tuning in large language models like Llama 3 can be easily subverted by attackers with access to model weights, achieving jailbreaking in as little as one minute for an 8B model and 30 minutes for a 70B model on a single GPU.

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

View on arXiv PDF

Similar