Unified Text-Image Generation with Weakness-Targeted Post-Training
This work addresses the limitation of cross-modal coupling in text-to-image synthesis for AI researchers, though it is incremental as it builds on existing unified architectures.
The paper tackled the problem of separate, sequential inference in unified multimodal generation by proposing a post-training approach for fully unified text-image generation, achieving improvements across four diverse text-to-image benchmarks.
Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.