CVNov 26, 2021

Traditional Chinese Synthetic Datasets Verified with Labeled Data for Scene Text Recognition

Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang, Yi-Ren Yeh

arXiv:2111.13327v21.44 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of data scarcity for Traditional Chinese text recognition, which is incremental as it applies existing synthetic data methods to a new domain.

The paper tackles the lack of labeled data for Traditional Chinese scene text recognition by generating over 20 million synthetic data and collecting 7,000 manually labeled data as a benchmark, resulting in much better accuracy for text recognition models when trained or fine-tuned with this data.

Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking. This paper presents a framework for a Traditional Chinese synthetic data engine which aims to improve text recognition model performance. We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.

View on arXiv PDF Code

Similar