What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
This work addresses the problem of insufficient synthetic data realism for researchers and practitioners in computer vision, offering an incremental improvement through enhanced simulation and annotation efficiency.
The authors tackled the domain gap between synthetic and real data in Scene Text Recognition by developing UnionST, a synthetic data engine with improved diversity and realism, and a self-evolution learning framework for efficient real data annotation. Results show that models trained on UnionST-S outperform existing synthetic datasets, sometimes surpassing real-data performance, and achieve competitive results with only 9% of real data labels.
Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.