Self-Training for Domain Adaptive Scene Text Detection
This addresses the problem of expensive data annotation for domain adaptation in scene text detection, offering an incremental improvement over existing methods.
The paper tackles domain adaptation for scene text detection by proposing a self-training framework that mines hard examples with pseudo-labels from unannotated data, using a text mining module to reduce noise and an image-to-video generation method when videos are unavailable. Experimental results on benchmarks like ICDAR2015 show the method achieves comparable or superior results to state-of-the-art methods.
Though deep learning based scene text detection has achieved great progress, well-trained detectors suffer from severe performance degradation for different domains. In general, a tremendous amount of data is indispensable to train the detector in the target domain. However, data collection and annotation are expensive and time-consuming. To address this problem, we propose a self-training framework to automatically mine hard examples with pseudo-labels from unannotated videos or images. To reduce the noise of hard examples, a novel text mining module is implemented based on the fusion of detection and tracking results. Then, an image-to-video generation method is designed for the tasks that videos are unavailable and only images can be used. Experimental results on standard benchmarks, including ICDAR2015, MSRA-TD500, ICDAR2017 MLT, demonstrate the effectiveness of our self-training method. The simple Mask R-CNN adapted with self-training and fine-tuned on real data can achieve comparable or even superior results with the state-of-the-art methods.