LGAICRCVJan 3, 2025

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

arXiv:2501.02029v16 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses safety risks in LVLMs for users deploying multimodal AI systems, offering a novel detection method that is incremental in building on existing model internals.

The paper tackled the vulnerability of large vision-language models (LVLMs) to safety risks like jailbreaking by discovering that internal activations during the first token generation can identify malicious prompts, governed by sparse 'safety heads'; experiments showed that ablating these heads increases attack success rates while maintaining model utility, and a detector built from them achieved strong zero-shot generalization with minimal overhead.

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes