From Shortcuts to Triggers: Backdoor Defense with Denoised PoE
This work addresses a critical security issue for language models by providing a universal defense against diverse backdoor attacks, which is incremental as it builds on existing methods focused on explicit triggers.
The paper tackles the problem of defending language models against diverse backdoor attacks, such as data poisoning with various triggers, by proposing DPoE, an ensemble-based framework that uses a shallow model to capture backdoor shortcuts and a main model to avoid learning them, with experiments on SST-2 showing significant improvements in defense performance against word-level, sentence-level, and syntactic triggers, including in mixed-trigger settings.
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.