LGCLCRCVFeb 4, 2024

Jailbreaking Attack against Multimodal Large Language Model

arXiv:2402.02309v1141 citationsh-index: 22Has Code
Originality Highly original
AI Analysis

This work addresses security vulnerabilities in multimodal and large language models for AI safety researchers, presenting a novel attack method with broad applicability.

This paper tackles the problem of jailbreaking attacks against multimodal large language models (MLLMs) by proposing a maximum likelihood-based algorithm to generate image Jailbreaking Prompts (imgJP), which successfully elicits objectionable responses across multiple models like MiniGPT-v2 and LLaVA with data-universal and model-transferable properties, and extends the method to LLM-jailbreaks with greater efficiency than state-of-the-art methods.

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes