Character-Aware Models Improve Visual Text Rendering
This addresses a specific bottleneck in text-to-image generation for applications requiring accurate text rendering, representing an incremental improvement.
The paper tackles the problem of unreliable visual text generation in image models by identifying the lack of character-level features as a key issue, and shows that character-aware models improve accuracy by over 30 points on rare words in visual spelling tasks.
Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify this effect, we conduct a series of experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Applying our learnings to the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples.