Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation
This addresses a specific problem in image generation for users needing to combine multiple personalized concepts, representing an incremental improvement over prior LoRA-based techniques.
The paper tackles the challenge of generating images that combine multiple concepts using pre-trained LoRA models, where existing methods often fail to include all concepts or produce incorrect combinations. The proposed CLoRA method updates attention maps at test-time to fuse latent representations, resulting in significantly better performance in multi-concept image generation compared to existing methods.
Low-Rank Adaptation (LoRA) has emerged as a powerful and popular technique for personalization, enabling efficient adaptation of pre-trained image generation models for specific tasks without comprehensive retraining. While employing individual pre-trained LoRA models excels at representing single concepts, such as those representing a specific dog or a cat, utilizing multiple LoRA models to capture a variety of concepts in a single image still poses a significant challenge. Existing methods often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). We introduce CLoRA, a training-free approach that addresses these limitations by updating the attention maps of multiple LoRA models at test-time, and leveraging the attention maps to create semantic masks for fusing latent representations. This enables the generation of composite images that accurately reflect the characteristics of each LoRA. Our comprehensive qualitative and quantitative evaluations demonstrate that CLoRA significantly outperforms existing methods in multi-concept image generation using LoRAs.