Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection
This work addresses backdoor vulnerabilities in neural networks, offering a post-training mitigation and detection method that is incremental but effective for security applications.
The paper tackles the problem of backdoor attacks in deep neural networks by proposing an improved activation clipping method that limits classification margins, achieving superior performance on CIFAR-10 image classification with strong robustness against adaptive attacks and across datasets.
Deep neural networks are vulnerable to backdoor attacks (Trojans), where an attacker poisons the training set with backdoor triggers so that the neural network learns to classify test-time triggers to the attacker's designated target class. Recent work shows that backdoor poisoning induces over-fitting (abnormally large activations) in the attacked model, which motivates a general, post-training clipping method for backdoor mitigation, i.e., with bounds on internal-layer activations learned using a small set of clean samples. We devise a new such approach, choosing the activation bounds to explicitly limit classification margins. This method gives superior performance against peer methods for CIFAR-10 image classification. We also show that this method has strong robustness against adaptive attacks, X2X attacks, and on different datasets. Finally, we demonstrate a method extension for test-time detection and correction based on the output differences between the original and activation-bounded networks. The code of our method is online available.