Espresso: Robust Concept Filtering in Text-to-Image Models
This addresses the need for safer and more reliable text-to-image generation by filtering out harmful or infringing content, though it is incremental as it builds on existing concept removal techniques.
The paper tackles the problem of removing unacceptable concepts from text-to-image models by introducing Espresso, a robust concept filter based on CLIP that prevents generation of such images while preserving utility, showing it outperforms prior methods in effectiveness and robustness.
Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright-infringing or unsafe). We need concept removal techniques (CRTs) which are i) effective in preventing the generation of images with unacceptable concepts, ii) utility-preserving on acceptable concepts, and, iii) robust against evasion with adversarial prompts. No prior CRT satisfies all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). We identify unacceptable concepts by using the distance between the embedding of a generated image to the text embeddings of both unacceptable and acceptable concepts. This lets us fine-tune for robustness by separating the text embeddings of unacceptable and acceptable concepts while preserving utility. We present a pipeline to evaluate various CRTs to show that Espresso is more effective and robust than prior CRTs, while retaining utility.