CVNov 27, 2024

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

arXiv:2411.18301v112 citationsh-index: 12Has CodeIEEE Trans Pattern Anal Mach Intell
Originality Incremental advance
AI Analysis

This addresses a specific issue in text-to-image generation for users needing accurate multi-subject outputs, representing an incremental improvement.

The paper tackles the problem of subject neglect or mixing in MMDiT-based text-to-image models when generating multiple similar subjects, achieving superior generation quality and higher success rates on a new challenging dataset.

Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at https://github.com/wtybest/EnMMDiT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes