Universal Adversarial Attack on Aligned Multimodal LLMs
This exposes critical vulnerabilities in multimodal AI safety for users and developers, though it is an incremental improvement over existing text-only universal prompt attacks.
The authors tackled the problem of bypassing alignment safeguards in multimodal Large Language Models by developing a universal adversarial attack that uses a single optimized image to force models to generate unsafe content across diverse queries, achieving up to 81% attack success rates on certain models in benchmarks like SafeBench and MM-SafetyBench.
We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., "Sure, here it is") or otherwise unsafe content -- even for harmful prompts. In experiments on the SafeBench and MM-SafetyBench benchmarks, our method achieves higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 81% on certain models). We further demonstrate cross-model universality by training on several multimodal LLMs simultaneously. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive.