Memories of Forgotten Concepts
This exposes vulnerabilities in current concept ablation methods for text-to-image diffusion models, which could impact safety and privacy applications.
The paper reveals that concept ablation techniques in diffusion models fail to fully erase targeted concepts, as erased concept information persists and can be regenerated using specific latent seeds, demonstrating that complete concept erasure may be intractable.
Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts. In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts. Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept. We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept. Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.