CVDec 19, 2023

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

arXiv:2312.12232v156 citationsh-index: 13AAAI
Originality Incremental advance
AI Analysis

This work addresses the problem of generating realistic scene text in images for applications like multilingual content creation, though it is incremental as it builds on pre-trained Stable Diffusion.

The authors tackled the challenge of generating accurate multilingual scene text in images using diffusion models, proposing Diff-Text, a training-free framework that improves text recognition accuracy and foreground-background blending compared to existing methods.

Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes