Dark Miner: Defend against undesirable generation for text-to-image diffusion models
This addresses safety and ethical concerns for users and developers of text-to-image models, though it is incremental as it builds on existing erasure methods.
The paper tackles the problem of undesirable image generation in text-to-image diffusion models, such as from sexual or copyrighted content, by proposing Dark Miner, a method that reduces generation probabilities of target concepts and achieves better erasure and defense results, especially under adversarial attacks, while preserving model capabilities.
Text-to-image diffusion models have been demonstrated with undesired generation due to unfiltered large-scale training data, such as sexual images and copyrights, necessitating the erasure of undesired concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing target concepts. However, they fail to guarantee the desired generation of texts unseen in the training phase, especially for the adversarial texts from malicious attacks. In this paper, we analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of undesired generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. This method greedily mines embeddings with maximum generation probabilities of target concepts and more effectively reduces their generation. In the experiments, we evaluate its performance on the inappropriateness, object, and style concepts. Compared with the previous methods, our method achieves better erasure and defense results, especially under multiple adversarial attacks, while preserving the native generation capability of the models. Our code will be available on GitHub.