LGCLCRMLApr 14, 2020

Weight Poisoning Attacks on Pre-trained Models

arXiv:2004.06660v1585 citationsHas Code
Originality Highly original
AI Analysis

This addresses a critical security problem for users of pre-trained models in NLP, highlighting a novel vulnerability that could compromise model integrity in real-world applications.

The paper tackles the security threat of downloading untrusted pre-trained models by demonstrating weight poisoning attacks that inject backdoors, enabling attackers to manipulate predictions via arbitrary keywords after fine-tuning, with experiments showing applicability across sentiment classification, toxicity detection, and spam detection.

Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes