Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss
This work addresses the challenge of enhancing creativity in text-to-image models, though it appears incremental as it builds on existing style ambiguity concepts.
The paper tackles the problem of training text-to-image models for creativity by proposing a new style ambiguity loss that eliminates the need for a pretrained classifier or labeled dataset, and finds that this method improves upon traditional approaches based on automated metrics for human judgment.
Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.