Self-Mined Hardness for Safety Fine-Tuning

arXiv:2605.0322667.2

Predicted impact top 27% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the problem of reducing attack success rates in safety fine-tuning for LLMs, but the high refusal rates on benign prompts and trade-offs indicate incremental progress over existing methods.

Self-mined hardness for safety fine-tuning selects the hardest prompts based on the model's own jailbreak rate, reducing WildJailbreak attack success rate from 11.5% to 1-3% on Llama-3-8B-Instruct and from 20.1% to 1-3% on Llama-3.2-3B-Instruct, but increases refusal on benign jailbreak-shaped prompts from 14-22% to 74-94%; interleaving with benign prompts reduces refusal to 30-51% (8B) and 52-72% (3B) with a 2-6 percentage point ASR increase.

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

View on arXiv PDF

Similar