CR AI LGApr 29, 2025

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Jonas Henry Grebe, Tobias Braun, Marcus Rohrbach, Anna Rohrbach

arXiv:2504.21072v115.77 citationsh-index: 5

Originality Highly original

AI Analysis

This reveals a critical security vulnerability in machine unlearning techniques for diffusion models, which could allow adversaries to bypass content safety measures.

The paper demonstrates that concept erasure methods in text-to-image diffusion models can be circumvented through targeted backdoor attacks, with success rates up to 82% for celebrity identity erasure and up to 9 times more exposed body parts for explicit content erasure.

The expansion of large-scale text-to-image diffusion models has raised growing concerns about their potential to generate undesirable or harmful content, ranging from fabricated depictions of public figures to sexually explicit images. To mitigate these risks, prior work has devised machine unlearning techniques that attempt to erase unwanted concepts through fine-tuning. However, in this paper, we introduce a new threat model, Toxic Erasure (ToxE), and demonstrate how recent unlearning algorithms, including those explicitly designed for robustness, can be circumvented through targeted backdoor attacks. The threat is realized by establishing a link between a trigger and the undesired content. Subsequent unlearning attempts fail to erase this link, allowing adversaries to produce harmful content. We instantiate ToxE via two established backdoor attacks: one targeting the text encoder and another manipulating the cross-attention layers. Further, we introduce Deep Intervention Score-based Attack (DISA), a novel, deeper backdoor attack that optimizes the entire U-Net using a score-based objective, improving the attack's persistence across different erasure methods. We evaluate five recent concept erasure methods against our threat model. For celebrity identity erasure, our deep attack circumvents erasure with up to 82% success, averaging 57% across all erasure methods. For explicit content erasure, ToxE attacks can elicit up to 9 times more exposed body parts, with DISA yielding an average increase by a factor of 2.9. These results highlight a critical security gap in current unlearning strategies.

View on arXiv PDF

Similar