Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution
This addresses a practical security issue for users of NLP models from platforms like HuggingFace, offering a defense without retraining, though it is incremental as it builds on prior merging defenses.
The paper tackles the problem of backdoor attacks in NLP models deployed from untrusted sources by proposing Guided Module Substitution (GMS), a retraining-free method that selectively replaces modules using a guided trade-off signal, achieving strong effectiveness and outperforming baselines, particularly against challenging attacks like LWS.
Model NLP models are commonly trained (or fine-tuned) on datasets from untrusted platforms like HuggingFace, posing significant risks of data poisoning attacks. A practical yet underexplored challenge arises when such backdoors are discovered after model deployment, making retraining-required defenses less desirable due to computational costs and data constraints. In this work, we propose Guided Module Substitution (GMS), an effective retraining-free method based on guided merging of the victim model with just a single proxy model. Unlike prior ad-hoc merging defenses, GMS uses a guided trade-off signal between utility and backdoor to selectively replaces modules in the victim model. GMS offers four desirable properties: (1) robustness to the choice and trustworthiness of the proxy model, (2) applicability under inaccurate data knowledge, (3) stability across hyperparameters, and (4) transferability across different attacks. Extensive experiments on encoder models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS significantly outperforms even the strongest defense baseline, particularly against challenging attacks like LWS.