Constrained Optimization with Dynamic Bound-scaling for Effective NLPBackdoor Defense
This work addresses a critical security issue for NLP systems by providing an effective defense against backdoor attacks, though it is incremental as it builds upon existing inversion techniques.
The paper tackles the problem of detecting and removing backdoors in NLP models by developing a novel optimization method for trigger inversion, which outperforms four baseline methods across over 1600 models on three NLP tasks with four attack types and seven architectures.
We develop a novel optimization method for NLPbackdoor inversion. We leverage a dynamically reducing temperature coefficient in the softmax function to provide changing loss landscapes to the optimizer such that the process gradually focuses on the ground truth trigger, which is denoted as a one-hot value in a convex hull. Our method also features a temperature rollback mechanism to step away from local optimals, exploiting the observation that local optimals can be easily deter-mined in NLP trigger inversion (while not in general optimization). We evaluate the technique on over 1600 models (with roughly half of them having injected backdoors) on 3 prevailing NLP tasks, with 4 different backdoor attacks and 7 architectures. Our results show that the technique is able to effectively and efficiently detect and remove backdoors, outperforming 4 baseline methods.