CR AI LGFeb 26, 2025

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy Dvijotham

arXiv:2502.19537v518.56 citationsh-index: 36Has Code

Originality Highly original

AI Analysis

This work exposes vulnerabilities in production fine-tuning APIs used by major providers like OpenAI and Anthropic, undermining the safety of models that initially refuse harmful requests.

The paper tackles the problem of bypassing safety mechanisms in fine-tuned language models by introducing a 'refuse-then-comply' attack that first refuses harmful requests before answering them, achieving attack success rates of 57% against GPT-4o and 72% against Claude Haiku.

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To prevent abuse, these providers apply filters to block fine-tuning on overtly harmful data. In this setting, we make three contributions: First, while past work has shown that safety alignment is "shallow", we correspondingly demonstrate that existing fine-tuning attacks are shallow -- attacks target only the first several tokens of the model response, and consequently can be blocked by generating the first several response tokens with an aligned model. Second, we conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters. Third, we demonstrate the potency of our new fine-tuning attack by jailbreaking both open-source models equipped with defenses and production models, achieving attack success rates of 57% and 72% against GPT-4o and Claude Haiku, respectively. Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic. Our work undermines the notion that models are safe because they initially refuse harmful requests and broadens awareness of the scope of attacks that face production fine-tuning APIs.

View on arXiv PDF

Similar