Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
This work addresses the efficiency problem for users of large-scale text-to-image generation, offering an incremental improvement by optimizing existing pipelines.
The paper tackles the computational expense of text-to-image diffusion models by reducing redundancy across correlated prompts, using a training-free method that clusters prompts and shares computation in early diffusion steps, resulting in significantly reduced compute cost while improving image quality.
Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/