Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
This addresses the risk of generating biased or harmful images in text-to-image models, which is a concern for users and developers, though it is incremental as it builds on existing interpretable latent space methods.
The paper tackles the problem of inappropriate content generation in diffusion-based text-to-image models by developing a self-supervised method to discover interpretable latent directions for arbitrary concepts, such as bias or harm, and proposes a mitigation approach that shows effectiveness in fair, safe, and responsible generation tasks.
Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.