Scaling Concept With Text-Guided Diffusion Models
This work addresses the need for fine-grained control over concepts in generative models, enabling tasks like canonical pose generation and sound highlighting, though it is incremental as it builds on existing text-guided diffusion frameworks.
The paper tackles the problem of enhancing or suppressing existing concepts in text-guided diffusion models, rather than replacing them, and introduces ScalingConcept, a method that scales decomposed concepts in real inputs without adding new elements, achieving novel zero-shot applications across image and audio domains.
Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.