CRLGSep 12, 2023

Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning System

arXiv:2310.10659v28 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses a security threat for AI systems by exposing vulnerabilities in machine unlearning processes, though it is incremental as it builds on existing backdoor attack and defense research.

The paper tackles the problem of backdoor attacks in deep learning by proposing a novel black-box attack based on machine unlearning, which successfully implants backdoors into models and increases computational overhead for defenses, while also introducing detection methods that are effective but less so with sharding.

In recent years, the security issues of artificial intelligence have become increasingly prominent due to the rapid development of deep learning research and applications. Backdoor attack is an attack targeting the vulnerability of deep learning models, where hidden backdoors are activated by triggers embedded by the attacker, thereby outputting malicious predictions that may not align with the intended output for a given input. In this work, we propose a novel black-box backdoor attack based on machine unlearning. The attacker first augments the training set with carefully designed samples, including poison and mitigation data, to train a `benign' model. Then, the attacker posts unlearning requests for the mitigation samples to remove the impact of relevant data on the model, gradually activating the hidden backdoor. Since backdoors are implanted during the iterative unlearning process, it significantly increases the computational overhead of existing defense methods for backdoor detection or mitigation. To address this new security threat, we proposes two methods for detecting or mitigating such malicious unlearning requests. We conduct the experiment in both exact unlearning and approximate unlearning (i.e., SISA) settings. Experimental results indicate that: 1) our attack approach can successfully implant backdoor into the model, and sharding increases the difficult of attack; 2) our detection algorithms are effective in identifying the mitigation samples, while sharding reduces the effectiveness of our detection algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes