CRAIMay 8

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

arXiv:2605.0827797.3Has Code
AI Analysis

This work addresses the vulnerability of safety-aligned language models to many-shot jailbreak attacks, offering a practical defense that requires only one demonstration at inference time.

Many-shot jailbreaking (MSJ) attacks on safety-aligned language models become stronger with more harmful demonstrations due to progressive activation drift, which is theoretically equivalent to implicit malicious fine-tuning. The authors propose a defense that appends a single safety demonstration at inference time, which counteracts the drift and restores refusal behavior without parameter modification or white-box access.

Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer demonstrations. We study why this attack becomes stronger as the number of demonstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as more harmful demonstrations are added. Theoretically, we show that this drift can be interpreted as implicit malicious fine-tuning: conditioning on N harmful demonstrations induces SGD-style updates equivalent to optimizing on the corresponding N harmful samples. This view turns the attack mechanism into a defense principle. We append a fixed one-shot safety demonstration at inference time, which induces a counteracting safety-oriented update and restores refusal behavior. The resulting method improves the model's robustness to MSJ without modifying its parameters or requiring white-box access at deployment. Code is available at https://github.com/Thecommonirin/SafeEnd.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes