Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet
This reveals a critical security flaw in open-source ControlNet pipelines, posing risks for users relying on community-shared data, and is incremental as it applies an existing attack type to a new model context.
The paper tackles the vulnerability of ControlNet-guided diffusion models to stealthy data poisoning attacks, where poisoned samples cause the model to generate NSFW images without text triggers while maintaining clean-prompt fidelity, achieving a high attack success rate on large-scale datasets.
Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets -- and the increasing use of community-shared data for fine-tuning -- exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples -- each pairing a subtly triggered input with an NSFW target -- the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.