CRAIMar 6

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

arXiv:2603.05772v11 citationsHas Code
Originality Highly original
AI Analysis

This research identifies a new vulnerability in the deeper components of OSLLMs, specifically attention heads, which could lead to more robust jailbreak attacks for security researchers and red teamers.

This paper introduces Safety Attention Head Attack (SAHA), a jailbreak framework that targets deep safety attention heads in open-sourced large language models (OSLLMs). SAHA improves the attack success rate (ASR) by 14% over state-of-the-art baselines by identifying and perturbing vulnerable deep attention layers, demonstrating a new attack surface.

Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes