Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation
This work addresses the problem of generating compositional images with multiple subjects for users in text-to-image generation, offering a training-free solution that improves subject fidelity and text-image alignment, though it appears incremental as it builds on existing attention mechanisms.
The paper tackles the problem of subject-driven text-to-image generation, where existing models struggle with tedious fine-tuning and issues like object missing and attribute mixing in compositional prompts, by proposing a training-free guidance method that strengthens attention maps for precise attribute binding and feature injection, achieving exceptional zero-shot generation ability and introducing a novel metric GroundingScore for evaluation.
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.