CR AI CL LGJul 19, 2023

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov

arXiv:2307.10490v435.1148 citationsh-index: 69Has Code

Originality Incremental advance

AI Analysis

This addresses a security vulnerability in multi-modal AI systems, posing risks for users relying on these models, and is incremental as it builds on existing adversarial attack methods.

The paper tackles the problem of indirect prompt injection in multi-modal LLMs by showing that adversarial perturbations in images or sounds can steer models like LLaVa and PandaGPT to output attacker-chosen text or follow malicious instructions, as demonstrated through proof-of-concept examples.

We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.

View on arXiv PDF Code

Similar