CVDec 8, 2023

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Tongkun Guan, Wei Shen, Xue Yang, Xuehui Wang, Xiaokang Yang

arXiv:2312.05286v39.19 citationsh-index: 8Has CodeECCV

Originality Incremental advance

AI Analysis

This work addresses the problem of limited annotated real data for scene text detection, offering a domain adaptation solution that is incremental but provides measurable improvements.

The paper tackles the synth-to-real domain gap in pre-training scene text detectors by proposing FreeReal, a paradigm that combines synthetic and unlabeled real data via a glyph-based mixing mechanism, achieving average performance gains of 1.97% to 4.56% across multiple methods and datasets.

Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal.

View on arXiv PDF Code

Similar