Label-Consistent Backdoor Attacks
This addresses a security vulnerability in machine learning models for applications where human inspection might detect malicious inputs, though it is incremental as it builds on existing backdoor attack methods.
The paper tackles the problem of backdoor attacks in deep neural networks being detectable due to mislabeled inputs, and proposes a method using adversarial perturbations and generative models to create label-consistent attacks that remain undetected while maintaining effectiveness.
Deep neural networks have been demonstrated to be vulnerable to backdoor attacks. Specifically, by injecting a small number of maliciously constructed inputs into the training set, an adversary is able to plant a backdoor into the trained model. This backdoor can then be activated during inference by a backdoor trigger to fully control the model's behavior. While such attacks are very effective, they crucially rely on the adversary injecting arbitrary inputs that are---often blatantly---mislabeled. Such samples would raise suspicion upon human inspection, potentially revealing the attack. Thus, for backdoor attacks to remain undetected, it is crucial that they maintain label-consistency---the condition that injected inputs are consistent with their labels. In this work, we leverage adversarial perturbations and generative models to execute efficient, yet label-consistent, backdoor attacks. Our approach is based on injecting inputs that appear plausible, yet are hard to classify, hence causing the model to rely on the (easier-to-learn) backdoor trigger.