CRAIAug 18, 2025

Involuntary Jailbreak

arXiv:2508.13246v12 citationsh-index: 3
Originality Highly original
AI Analysis

This reveals a critical safety flaw in LLMs that could undermine their guardrails, posing risks for users relying on these models for secure and aligned interactions.

The study identifies a new vulnerability called involuntary jailbreak in Large Language Models, where a single universal prompt can compromise the entire guardrail structure, consistently jailbreaking leading models like Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1.

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes