Traditional Chinese Synthetic Datasets Verified with Labeled Data for Scene Text Recognition
This addresses the problem of data scarcity for Traditional Chinese text recognition, which is incremental as it applies existing synthetic data methods to a new domain.
The paper tackles the lack of labeled data for Traditional Chinese scene text recognition by generating over 20 million synthetic data and collecting 7,000 manually labeled data as a benchmark, resulting in much better accuracy for text recognition models when trained or fine-tuned with this data.
Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking. This paper presents a framework for a Traditional Chinese synthetic data engine which aims to improve text recognition model performance. We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.