DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
This addresses a specific bottleneck in visual text generation for applications requiring accurate rendering of complex textual prompts, though it is incremental as it builds on existing Multi-Modal Diffusion Transformers.
The paper tackles the problem of generating images with long or multiple texts, which existing text-to-image models struggle with due to diluted global attention. The result is that DCText achieves the best text accuracy without compromising image quality and delivers the lowest generation latency, as shown in experiments on single- and multi-sentence benchmarks.
Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.