CRAICLAug 27, 2025

Safety Alignment Should Be Made More Than Just A Few Attention Heads

arXiv:2508.19697v13 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses safety vulnerabilities in large language models for users and developers, representing an incremental improvement in safety alignment techniques.

The paper tackles the vulnerability of LLM safety alignment to adversarial attacks by showing that safety mechanisms depend on a limited subset of attention heads, and proposes AHD training to distribute safety behaviors across more heads, resulting in stronger safety robustness against jailbreak attacks while maintaining utility.

Current safety alignment for large language models(LLMs) continues to present vulnerabilities, given that adversarial prompting can effectively bypass their safety measures.Our investigation shows that these safety mechanisms predominantly depend on a limited subset of attention heads: removing or ablating these heads can severely compromise model safety. To identify and evaluate these safety-critical components, we introduce RDSHA, a targeted ablation method that leverages the model's refusal direction to pinpoint attention heads mostly responsible for safety behaviors. Further analysis shows that existing jailbreak attacks exploit this concentration by selectively bypassing or manipulating these critical attention heads. To address this issue, we propose AHD, a novel training strategy designed to promote the distributed encoding of safety-related behaviors across numerous attention heads. Experimental results demonstrate that AHD successfully distributes safety-related capabilities across more attention heads. Moreover, evaluations under several mainstream jailbreak attacks show that models trained with AHD exhibit considerably stronger safety robustness, while maintaining overall functional utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes