CRAICLJul 6, 2025

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

arXiv:2507.04365v14 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses a critical safety issue for LLM users by providing a mechanistic understanding and a practical defense against jailbreak attacks, though it is incremental as it builds on existing defense concepts.

The paper tackles the problem of jailbreak attacks on large language models by identifying a universal phenomenon called Attention Slipping, where models reduce attention to unsafe requests during attacks, and proposes a defense called Attention Sharpening that effectively resists various attacks while maintaining performance on benign tasks, with experiments showing it works across four leading LLMs without extra overhead.

As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes