Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets
This work addresses the challenge of enhancing text recognition accuracy for applications like OCR, offering a competitive alternative to transfer learning but is incremental in its methodological contributions.
The paper tackles the problem of improving text recognition transformers by proposing masked self-supervised pre-training with modifications like progressive masking and loss adjustments, resulting in up to a 30% relative reduction in character error rate without needing extra annotated data.
Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.