Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning
This addresses a critical security vulnerability in neural networks for applications where only few clean samples are available, representing a strong specific gain in backdoor defense.
The paper tackled the problem of removing backdoors from deep neural networks with limited clean data by proposing ULRL, a two-phase method that unlearns to expose sensitive neurons and relearns to recalibrate them, achieving significant reductions in attack success rates while maintaining clean accuracy using only 1% of clean data.
Deep neural networks have achieved remarkable success across various applications; however, their vulnerability to backdoor attacks poses severe security risks -- especially in situations where only a limited set of clean samples is available for defense. In this work, we address this critical challenge by proposing ULRL (UnLearn and ReLearn for backdoor removal), a novel two-phase approach for comprehensive backdoor removal. Our method first employs an unlearning phase, in which the network's loss is intentionally maximized on a small clean dataset to expose neurons that are excessively sensitive to backdoor triggers. Subsequently, in the relearning phase, these suspicious neurons are recalibrated using targeted reinitialization and cosine similarity regularization, effectively neutralizing backdoor influences while preserving the model's performance on benign data. Extensive experiments with 12 backdoor types on multiple datasets (CIFAR-10, CIFAR-100, GTSRB, and Tiny-ImageNet) and architectures (PreAct-ResNet18, VGG19-BN, and ViT-B-16) demonstrate that ULRL significantly reduces the attack success rate without compromising clean accuracy -- even when only 1% of clean data is used for defense.