CVDec 2, 2024

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen

arXiv:2412.01137v28.78 citationsh-index: 20Has Code

Originality Highly original

AI Analysis

This addresses the data scarcity issue for researchers and practitioners in scene text recognition, offering a scalable synthetic data solution with clear performance gains.

The paper tackles the problem of limited realistic synthetic training data for scene text recognition by introducing TextSSR, a diffusion-based pipeline that synthesizes accurate and realistic text images at scale, resulting in a dataset of 3.55 million instances that improves model performance on benchmarks.

Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text at scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available at https://github.com/YesianRohn/TextSSR.

View on arXiv PDF Code

Similar