Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations
This addresses the issue of adversarial attacks and unethical content in generative models for users needing safe and controllable outputs, offering a scalable and efficient solution.
The paper tackles the problem of unsafe or unwanted concept generation in text-to-image models by proposing a test-time framework using k-sparse autoencoders to steer concepts without retraining, achieving a 20.01% improvement in unsafe concept removal and being about 5 times faster than the state-of-the-art.
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers