LGAICRFeb 24, 2025

Rethinking the Vulnerability of Concept Erasure and a New Method

arXiv:2502.17537v33 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses security and privacy concerns for users of diffusion models by exposing critical flaws in current defense mechanisms, though it is incremental as it builds on prior restoration methods.

The paper tackles the vulnerability of concept erasure methods in text-to-image diffusion models, showing that erased concepts can be recovered via adversarial prompts, and introduces RECORD, a restoration algorithm that outperforms existing methods by up to 17.8 times.

The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times. We conduct extensive experiments to assess its compute-performance tradeoff and propose acceleration strategies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes