CVDBMay 19

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

arXiv:2605.1932086.9
AI Analysis

For practitioners deploying text-to-image models, this offers a non-invasive, scalable alternative to architecture-specific modifications for improving text rendering.

TextAlign addresses poor text rendering in text-to-image models by framing it as a post-training preference-alignment problem, using a hierarchical VLM-based reward to improve OCR accuracy without modifying the generator architecture. On FLUX.1-dev and Z-Image-Turbo, it achieves consistent gains in text accuracy while maintaining general generation quality.

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes