Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation
This provides an interpretable metric for evaluating diffusion model consistency, aiding model selection and architecture assessment, though it is incremental as it builds on existing CLIP-based methods.
The study tackled the problem of quantifying the repeatability of image generation in diffusion models by proposing a semantic consistency score based on pairwise mean CLIP scores, finding statistically significant differences between models like Stable Diffusion XL and PixArt-α, with 94% agreement with human annotations and higher consistency in fine-tuned versions.
In this study, we identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models. We propose a semantic approach, using a pairwise mean CLIP (Contrastive Language-Image Pretraining) score as our semantic consistency score. We applied this metric to compare two state-of-the-art open-source image generation diffusion models, Stable Diffusion XL and PixArt-α, and we found statistically significant differences between the semantic consistency scores for the models. Agreement between the Semantic Consistency Score selected model and aggregated human annotations was 94%. We also explored the consistency of SDXL and a LoRA-fine-tuned version of SDXL and found that the fine-tuned model had significantly higher semantic consistency in generated images. The Semantic Consistency Score proposed here offers a measure of image generation alignment, facilitating the evaluation of model architectures for specific tasks and aiding in informed decision-making regarding model selection.