CVCRAug 25, 2024

HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

arXiv:2408.13896v314 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in text-to-image models for developers and users, though it is incremental as it builds on prior black-box attack methods.

The paper tackles the problem of jailbreaking text-to-image models to generate inappropriate content by proposing HTS-Attack, a heuristic token search method that achieves high success rates in bypassing various defenses, including prompt checkers and commercial models, with concrete improvements over existing black-box attacks.

Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes