Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

arXiv:2603.2169796.0h-index: 1Has Code

AI Analysis

This addresses a safety vulnerability in MLLMs for users relying on visual reasoning, though it is incremental as it builds on existing jailbreak benchmarks.

The study tackled the problem of multimodal large language models (MLLMs) being vulnerable to safety failures through comic-template jailbreaks, finding that these attacks achieved success rates comparable to strong rule-based jailbreaks and exceeded 90% on several commercial models.

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.

View on arXiv PDF

Similar