CRAILGFeb 21, 2025

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

arXiv:2502.15334v16 citationsh-index: 21EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of anticipating and defending against jailbreak attacks for AI safety researchers and practitioners, representing an incremental improvement over existing methods.

The paper tackles the problem of bypassing safety-alignment in large language models by introducing a new jailbreak attack method that manipulates attention to selectively strengthen or weaken attention among prompt parts, resulting in a 91.2% attack success rate on Llama2-7B/AdvBench compared to 67.9% for the original attack, with reduced generation time.

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes