An Embarrassingly Simple Defense Against LLM Abliteration Attacks
This addresses a security vulnerability in LLM alignment for users relying on safe content generation, though it is an incremental improvement over existing fine-tuning methods.
The paper tackles the problem of abliteration attacks on large language models, which suppress refusal behavior to generate harmful content, by proposing a defense that fine-tunes models on an extended-refusal dataset to distribute refusal signals across multiple tokens, resulting in refusal rates dropping by at most 10% under attack compared to 70-80% drops in baselines.
Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.