Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

Hyeonjin Kim, Hangyeol Jung, Heechan Yun, Sungjun Yun, Dong-Jun Han

arXiv:2605.1212242.91 citations

Predicted impact top 60% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners needing to suppress specific concepts in diffusion models without retraining, this work offers a more precise unlearning method that reduces unintended interference.

The paper tackles the problem of imprecise concept unlearning in text-to-image diffusion models due to shared latent features in sparse autoencoders. The proposed SAEParate method achieves state-of-the-art performance on UnlearnCanvas, with strong gains in joint style-object unlearning.

Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.

View on arXiv PDF

Similar