Consistent Subject Generation via Contrastive Instantiated Concepts
This addresses the limitation of subject variation in long content generation for users of text-to-image models, though it appears incremental as it builds on existing approaches.
The paper tackles the problem of generating consistent subjects across multiple independent creations in text-to-image models, introducing Contrastive Concept Instantiation (CoCoIns) which achieves comparable performance to existing methods while offering higher flexibility.
While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.