LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
This work addresses the specific challenge of generating accurate and aesthetically pleasing text in images for applications in creative design and content creation, representing a strong incremental advance in the field.
The paper tackles the problem of low text rendering fidelity in text-to-image generation by introducing LeX-Art, a suite that includes a high-quality dataset, prompt enrichment model, and two models, achieving state-of-the-art performance with gains such as a 79.81% improvement in text accuracy on CreateBench.
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.