CR AIMay 17

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

arXiv:2605.1741372.7

AI Analysis

For security practitioners and AI safety researchers, this work provides a systematic evaluation of alignment removal methods, showing that compliance alone is insufficient for safe deployment.

The paper introduces a controlled protocol for removing alignment in language models to evaluate authorized cybersecurity tasks, finding that task-specific LoRA adaptation achieves high security success (0.87) with moderate unsafe compliance (0.13), while simple refusal suppression increases unsafe compliance from 0.10 to 0.47 without improving security.

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

View on arXiv PDF

Similar