CLAICRJan 19, 2025

Dagger Behind Smile: Fool LLMs with a Happy Ending Story

arXiv:2501.13115v36 citationsh-index: 3EMNLP
Originality Highly original
AI Analysis

This addresses security vulnerabilities in widely used LLMs, offering a more efficient and effective jailbreak method compared to existing approaches.

The paper tackles the problem of jailbreaking LLMs by introducing the Happy Ending Attack (HEA), which uses positive prompts like happy endings to fool models into generating malicious content, achieving an 88.79% average attack success rate on state-of-the-art LLMs such as GPT-4o.

The wide adoption of Large Language Models (LLMs) has attracted significant attention from $\textit{jailbreak}$ attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious contents. However, optimization-based attacks have limited efficiency and transferability, while existing manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to $\textit{positive}$ prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a $\textit{happy ending}$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. This has made HEA both efficient and effective, as it requires only up to two turns to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% attack success rate on average. We also provide quantitative explanations for the success of HEA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes