CLJul 7, 2025

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

arXiv:2507.05248v110 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This work addresses security vulnerabilities in LLMs for users and developers, though it is incremental as it builds on known jailbreak techniques with a new priming-based approach.

The paper tackles the problem of jailbreaking large language models (LLMs) by exploiting contextual priming vulnerabilities, where previous dialogue responses can bias models to generate harmful content, and it introduces Response Attack, which achieves higher attack success rates than existing methods across eight models.

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes