Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs
For LLM security, this provides a scalable defense against unknown backdoors by leveraging cross-backdoor transfer from unlearning a single trigger.
The paper demonstrates that unlearning a single backdoor in LLMs can suppress multiple unknown backdoors, and introduces Cross Activation Shift Distance to explain this phenomenon. Results across three model families show that backdoor removal generalizes, enabling defenders to inject and remove controlled backdoors to neutralize unknown ones.
Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.