CVAILGMar 28, 2025

Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets

arXiv:2503.22513v1h-index: 4ICDAR
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing text recognition accuracy for applications like OCR, offering a competitive alternative to transfer learning but is incremental in its methodological contributions.

The paper tackles the problem of improving text recognition transformers by proposing masked self-supervised pre-training with modifications like progressive masking and loss adjustments, resulting in up to a 30% relative reduction in character error rate without needing extra annotated data.

Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes