CVSep 25, 2025

A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models

arXiv:2509.21008v18 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses safety risks in text-to-image generation, offering a precise solution for concept erasure, though it is incremental as it builds on existing methods like sparse autoencoders.

The paper tackles the problem of precisely erasing harmful concepts in text-to-image diffusion models while minimizing image quality degradation, achieving state-of-the-art results by manipulating only a single neuron.

Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes