AIDec 12, 2023

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

arXiv:2312.07130v410.96 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses security vulnerabilities in AI safety systems for users and developers, but it is incremental as it builds on prior adversarial prompt methods.

The paper tackles the problem of bypassing safety filters in text-to-image models like DALL-E 3 and Midjourney by rephrasing drawing intents into benign descriptions, achieving success rates up to 76.7% and 64% in one-time attacks and 98% and 84% in re-use attacks.

To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).

View on arXiv PDF Code

Similar