Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
This work addresses security vulnerabilities in LLMs for users and developers, though it is incremental as it builds on known jailbreak techniques with a new priming-based approach.
The paper tackles the problem of jailbreaking large language models (LLMs) by exploiting contextual priming vulnerabilities, where previous dialogue responses can bias models to generate harmful content, and it introduces Response Attack, which achieves higher attack success rates than existing methods across eight models.
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.