VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models
This work addresses security risks in generative AI applications, particularly for users of diffusion models, but is incremental as it expands on existing backdoor analysis.
The paper tackles the vulnerability of Diffusion Models to backdoor attacks by introducing VillanDiffusion, a unified framework for analyzing such attacks across various model types and samplers, resulting in new insights into caption-based backdoor attacks.
Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}